Bank marketing campaign analysis

Jose Caloca

23/4/2021

Introduction.

In recent years, machine learning has become very important in the business world, as the intelligent use of data analytics is key to business success. For this project we will using the Bank Marketing Dataset from a portuguese bank, this dataset was originally uploaded to UCI’s Machine Learning Repository. This provides information on a marketing campaign that offers the results of contacts made offering time deposits from the financial institution in which it will be necessary to analyze and find future strategies to improve in future campaigns. A term deposit is a deposit offered by a bank or financial institution at a fixed rate (often better than simply opening a deposit account) in which your money will be returned to you at a specified time of maturity.

The aim of this project is to predict if the client will subscribe (yes/no) a term deposit (variable y), and to determine the factors behind a sucessful marketing campaign, and get a grasp of the features that influence on the probability of subscribing to a term deposit.

For this project we will be using R and Python at the same time in Rstudio - Rmarkdown.

Data description.

Load R packages and Python Modules

We will be using the following R packages and Python modules, we will load them as follows:

- R Packages:

#load R libraries
library(tidyverse)
library(DataExplorer)
library(htmltools)
library(ggstatsplot)
library(plotly)

- Python Modules:

#load Python modules
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import sweetviz as sv

import xgboost as xgb
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFE
from sklearn.model_selection import cross_val_predict, cross_val_score

from sklearn.metrics import confusion_matrix, classification_report, precision_recall_fscore_support, roc_curve, roc_auc_score, accuracy_score, recall_score, precision_score

from sklearn import metrics
from sklearn.metrics import confusion_matrix

Load dataset

The dataset used for this project can be found by clicking here. More specifically the dataset that we will load is the following: bank-additional-full.csv.

For loading the file we will call the read_csv function from pandas. It is it important to mention that there must be a folder called dataset in your main project folder that contains the aforementioned dataset. Or preferably it can be downloaded directly from the GitHub repository.

#Import dataset
dataset = pd.read_csv("dataset/bank-additional-full.csv", sep = ";")

Before starting our analysis, we must recode the output variable to a binary class (1 and 0), instead of “yes” and “no” strings.

dataset['y'] = dataset['y'].apply(lambda x: 0 if x =='no' else 1)
dataset.rename(columns = {"y" : "deposit"}, inplace = True)

The following chart shows a big picture of the dataset:

This dataset has 100% of complete rows, and has no missing values, nor missing columns. Hence, no imputation techniques are needed in any of the variables. Almost the half of the columns are numeric. In overall this is not a heavy dataset since it only occupies 6.6 Mb of memory.

The dataset has 21 columns and 41188 rows. The variables have the following attributes: \(\~\)

Bank client data:

\(\~\)

    • age (numeric)

2. - job : type of job (categorical: ‘admin.’,‘blue-collar’,‘entrepreneur’,‘housemaid’,‘management’,‘retired’,‘self-employed’,‘services’,‘student’,‘technician’,‘unemployed’,‘unknown’)

3. - marital : marital status (categorical: ‘divorced’,‘married’,‘single’,‘unknown’; note: ‘divorced’ means divorced or widowed)

4. - education (categorical: ‘basic.4y’,‘basic.6y’,‘basic.9y’,‘high.school’,‘illiterate’,‘professional.course’,‘university.degree’,‘unknown’)

5. - default: has credit in default? (categorical: ‘no’,‘yes’,‘unknown’)

6. - housing: has housing loan? (categorical: ‘no’,‘yes’,‘unknown’)

7. - loan: has personal loan? (categorical: ‘no’,‘yes’,‘unknown’) \(\~\)

Other attributes:

\(\~\)

12. - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

13. - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)

14. - previous: number of contacts performed before this campaign and for this client (numeric)

15. - poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,‘nonexistent’,‘success’) \(\~\)

Social and economic context attributes

\(\~\)

16. - emp.var.rate: employment variation rate - quarterly indicator (numeric)

17. - cons.price.idx: consumer price index - monthly indicator (numeric)

18. - cons.conf.idx: consumer confidence index - monthly indicator (numeric)

19. - euribor3m: euribor 3 month rate - daily indicator (numeric)

20. - nr.employed: number of employees - quarterly indicator (numeric) \(\~\)

Output variable (desired target):

\(\~\)

21. - deposit - has the client subscribed a term deposit? (binary: ‘yes’,‘no’)

dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  duration        41188 non-null  int64  
 11  campaign        41188 non-null  int64  
 12  pdays           41188 non-null  int64  
 13  previous        41188 non-null  int64  
 14  poutcome        41188 non-null  object 
 15  emp.var.rate    41188 non-null  float64
 16  cons.price.idx  41188 non-null  float64
 17  cons.conf.idx   41188 non-null  float64
 18  euribor3m       41188 non-null  float64
 19  nr.employed     41188 non-null  float64
 20  deposit         41188 non-null  int64  
dtypes: float64(5), int64(6), object(10)
memory usage: 6.6+ MB

From the above description we can see that the variable pdays (number of days that passed by after the client was last contacted from a previous campaign), those customers that were not previously contacted had a value of 999. To this we are going to recode this variable for zeros (0).

dataset['pdays'] = dataset['pdays'].apply(lambda x: 0 if x ==999 else x)

Exploratory Data Analysis

The exploratory data analysis (EDA) or descriptive statistics is a preliminary and essential step when it comes to understanding the data with which we are going to work and highly recommended for a correct research methodology.

The objective of this analysis is to explore, describe, summarize and visualize the nature of the data collected in the random variables of the project or research of interest, through the application of simple data summary techniques and graphic methods without assuming assumptions for their interpretation.

For the creation of the EDA graphs we will be using the Python library sweetviz. Sweetviz is a library that generates beautiful, high-density visualizations to kickstart EDA with just two lines of code. Output is a fully self-contained HTML application. The output is saved a HTML file in the project folder, and will be loaded to Rmarkdown to render by calling the function includeHTML from the htmltools package in R.

First, we look at some main summary statistics of our dataset and get a picture of the distribution of each variables.

      age                 job            marital     
 Min.   :17.00   admin.     :10422   divorced: 4612  
 1st Qu.:32.00   blue-collar: 9254   married :24928  
 Median :38.00   technician : 6743   single  :11568  
               education        default         housing           loan      
 university.degree  :12168   no     :32588   no     :18622   no     :33950  
 high.school        : 9515   unknown: 8597   unknown:  990   unknown:  990  
 basic.9y           : 6045   yes    :    3   yes    :21576   yes    : 6248  
      contact          month       day_of_week    duration     
 cellular :26144   may    :13769   fri:7827    Min.   :   0.0  
 telephone:15044   jul    : 7174   mon:8514    1st Qu.: 102.0  
                   aug    : 6178   thu:8623    Median : 180.0  
    campaign          pdays            previous            poutcome    
 Min.   : 1.000   Min.   : 0.0000   Min.   :0.000   failure    : 4252  
 1st Qu.: 1.000   1st Qu.: 0.0000   1st Qu.:0.000   nonexistent:35563  
 Median : 2.000   Median : 0.0000   Median :0.000   success    : 1373  
  emp.var.rate      cons.price.idx  cons.conf.idx     euribor3m    
 Min.   :-3.40000   Min.   :92.20   Min.   :-50.8   Min.   :0.634  
 1st Qu.:-1.80000   1st Qu.:93.08   1st Qu.:-42.7   1st Qu.:1.344  
 Median : 1.10000   Median :93.75   Median :-41.8   Median :4.857  
  nr.employed   deposit  
 Min.   :4964   0:36548  
 1st Qu.:5099   1: 4640  
 Median :5191            
 [ reached getOption("max.print") -- omitted 4 rows ]

From the above table we can state the most of the individuals have an admin and technician job position. Most of the clients (more than the half) are married. More than 50% of the customers have at least completed the high-school.

32588 of the customers in the campaign haven’t default in previous financial services. A bit more than the half of the customers have their own house, and far more of them have an own phone. The ownership of a phone might be not relevant nowadays, but years ago this was an important issue.

In average, employees change their jobs around 8% per year in a deflationary context (average CPI of 93.58).

Regarding the visualisation part of our EDA, First we create the report and export it the project folder.

#EDA using Autoviz
dataset_eda = sv.analyze(dataset)

#Saving results to HTML file
dataset_eda.show_html('Exploratory_Data_Analysis.html')

Second, we load the HTML file in the Rmarkdown notebook interface. We will go feature by feature in the following sections to see the range of values they have, how customers are distributed among these.

DataFrame
NO COMPARISON TARGET
41188
ROWS
12
DUPLICATES
29.6 MB
RAM
21
FEATURES
13
CATEGORICAL
8
NUMERICAL
0
TEXT
2.1.0
Get updates, docs & report issues here

Created & maintained by Francois Bertrand
Graphic design by Jean-Francois Hains
1
age
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
78
(<1%)
ZEROES:
---
MAX
98.0
95%
58.0
Q3
47.0
AVG
40.0
MEDIAN
38.0
Q1
32.0
5%
26.0
MIN
17.0
RANGE
81.0
IQR
15.0
STD
10.4
VAR
109
KURT.
0.791
SKEW
0.785
SUM
1.6M
2
job
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
12
(<1%)
3
marital
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
4
(<1%)
4
education
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
8
(<1%)
5
default
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
3
(<1%)
6
housing
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
3
(<1%)
7
loan
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
3
(<1%)
8
contact
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
2
(<1%)
9
month
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
10
(<1%)
10
day_of_week
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
5
(<1%)
11
duration
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
1,544
(4%)
ZEROES:
4
(<1%)
MAX
4,918
95%
753
Q3
319
AVG
258
MEDIAN
180
Q1
102
5%
36
MIN
0
RANGE
4,918
IQR
217
STD
259
VAR
67,226
KURT.
20.2
SKEW
3.26
SUM
10.6M
12
campaign
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
42
(<1%)
ZEROES:
---
MAX
56.0
95%
7.0
Q3
3.0
AVG
2.6
MEDIAN
2.0
Q1
1.0
5%
1.0
MIN
1.0
RANGE
55.0
IQR
2.00
STD
2.77
VAR
7.67
KURT.
37.0
SKEW
4.76
SUM
106k
13
pdays
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
26
(<1%)
ZEROES:
39,688
(96%)
MAX
27.0
95%
0.0
Q3
0.0
AVG
0.2
MEDIAN
0.0
Q1
0.0
5%
0.0
MIN
0.0
RANGE
27.0
IQR
0.00
STD
1.35
VAR
1.82
KURT.
76.4
SKEW
7.94
SUM
9,112
14
previous
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
8
(<1%)
15
poutcome
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
3
(<1%)
16
emp.var.rate
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
10
(<1%)
17
cons.price.idx
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
26
(<1%)
ZEROES:
---
MAX
94.77
95%
94.47
Q3
93.99
MEDIAN
93.75
AVG
93.58
Q1
93.08
5%
92.71
MIN
92.20
RANGE
2.57
IQR
0.919
STD
0.579
VAR
0.335
KURT.
-0.830
SKEW
-0.231
SUM
3.9M
18
cons.conf.idx
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
26
(<1%)
ZEROES:
---
MAX
-26.9
95%
-33.6
Q3
-36.4
AVG
-40.5
MEDIAN
-41.8
Q1
-42.7
5%
-47.1
MIN
-50.8
RANGE
23.9
IQR
6.30
STD
4.63
VAR
21.4
KURT.
-0.359
SKEW
0.303
SUM
-1.7M
19
euribor3m
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
316
(<1%)
ZEROES:
---
MAX
5.04
95%
4.97
Q3
4.96
MEDIAN
4.86
AVG
3.62
Q1
1.34
5%
0.80
MIN
0.63
RANGE
4.41
IQR
3.62
STD
1.73
VAR
3.01
KURT.
-1.41
SKEW
-0.709
SUM
149k
20
nr.employed
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
11
(<1%)
ZEROES:
---
MAX
5,228
95%
5,228
Q3
5,228
MEDIAN
5,191
AVG
5,167
Q1
5,099
5%
5,018
MIN
4,964
RANGE
264
IQR
129
STD
72.3
VAR
5,220
KURT.
-0.004
SKEW
-1.04
SUM
212.8M
21
deposit
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
2
(<1%)
Associations
[Only including dataset "DataFrame"] Squares are categorical associations (uncertainty coefficient & correlation ratio) from 0 to 1. The uncertainty coefficient is assymmetrical, (approximating how much the elements on the left PROVIDE INFORMATION on elements in the row). Circles are the symmetrical numerical correlations (Pearson's) from -1 to 1. The trivial diagonal is intentionally left blank for clarity.
Associations
[Only including dataset "None"] Squares are categorical associations (uncertainty coefficient & correlation ratio) from 0 to 1. The uncertainty coefficient is assymmetrical, (approximating how much the elements on the left PROVIDE INFORMATION on elements in the row). Circles are the symmetrical numerical correlations (Pearson's) from -1 to 1. The trivial diagonal is intentionally left blank for clarity.
age
MISSING:
---
>
NUMERICAL ASSOCIATIONS
(PEARSON, -1 to 1)

cons.conf.idx
0.13
pdays
0.02
nr.employed
-0.02
euribor3m
0.01
campaign
0.00
duration
-0.00
cons.price.idx
0.00

CATEGORICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)

job
0.50
marital
0.42
education
0.27
default
0.17
month
0.13
emp.var.rate
0.12
day_of_week
0.04
poutcome
0.04
previous
0.03
deposit
0.03
loan
0.01
contact
0.01
housing
0.00
MOST FREQUENT VALUES

31
1,947
4.7%
32
1,846
4.5%
33
1,833
4.5%
36
1,780
4.3%
35
1,759
4.3%
34
1,745
4.2%
30
1,714
4.2%
37
1,475
3.6%
29
1,453
3.5%
39
1,432
3.5%
38
1,407
3.4%
41
1,278
3.1%
40
1,161
2.8%
42
1,142
2.8%
45
1,103
2.7%
SMALLEST VALUES

17
5
<0.1%
18
28
<0.1%
19
42
0.1%
20
65
0.2%
21
102
0.2%
22
137
0.3%
23
226
0.5%
24
463
1.1%
25
598
1.5%
26
698
1.7%
27
851
2.1%
28
1,001
2.4%
29
1,453
3.5%
30
1,714
4.2%
31
1,947
4.7%
LARGEST VALUES

98
2
<0.1%
95
1
<0.1%
94
1
<0.1%
92
4
<0.1%
91
2
<0.1%
89
2
<0.1%
88
22
<0.1%
87
1
<0.1%
86
8
<0.1%
85
15
<0.1%
84
7
<0.1%
83
17
<0.1%
82
17
<0.1%
81
20
<0.1%
80
31
<0.1%
job
MISSING:
---
TOP CATEGORIES

admin.
10,422
25%
blue-collar
9,254
22%
technician
6,743
16%
services
3,969
10%
management
2,924
7%
retired
1,720
4%
entrepreneur
1,456
4%
self-employed
1,421
3%
housemaid
1,060
3%
unemployed
1,014
2%
student
875
2%
unknown
330
<1%
ALL
41,188
100%
CATEGORICAL ASSOCIATIONS
(UNCERTAINTY COEFFICIENT, 0 to 1)
job
PROVIDES INFORMATION ON...

education
0.23
marital
0.06
default
0.04
emp.var.rate
0.03
month
0.03
deposit
0.03
poutcome
0.02
previous
0.02
contact
0.01
loan
0.00
day_of_week
0.00
housing
0.00

THESE FEATURES
GIVE INFORMATION
ON job:

education
0.20
month
0.03
marital
0.02
emp.var.rate
0.02
default
0.01
deposit
0.00
poutcome
0.00
contact
0.00
previous
0.00
day_of_week
0.00
housing
0.00
loan
0.00

NUMERICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)
job
CORRELATION RATIO WITH...

age
0.50
nr.employed
0.23
euribor3m
0.19
cons.conf.idx
0.16
pdays
0.12
cons.price.idx
0.12
campaign
0.03
duration
0.03
marital
MISSING:
---
TOP CATEGORIES

married
24,928
61%
single
11,568
28%
divorced
4,612
11%
unknown
80
<1%
ALL
41,188
100%
CATEGORICAL ASSOCIATIONS
(UNCERTAINTY COEFFICIENT, 0 to 1)
marital
PROVIDES INFORMATION ON...

job
0.02
default
0.02
education
0.01
emp.var.rate
0.01
deposit
0.00
contact
0.00
previous
0.00
poutcome
0.00
month
0.00
housing
0.00
day_of_week
0.00
loan
0.00

THESE FEATURES
GIVE INFORMATION
ON marital:

job
0.06
education
0.02
default
0.01
emp.var.rate
0.01
month
0.00
contact
0.00
previous
0.00
deposit
0.00
poutcome
0.00
day_of_week
0.00
housing
0.00
loan
0.00

NUMERICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)
marital
CORRELATION RATIO WITH...

age
0.42
euribor3m
0.11
nr.employed
0.10
cons.price.idx
0.06
cons.conf.idx
0.06
pdays
0.04
campaign
0.01
duration
0.01
education
MISSING:
---
TOP CATEGORIES

university.degree
12,168
30%
high.school
9,515
23%
basic.9y
6,045
15%
professional.course
5,243
13%
basic.4y
4,176
10%
basic.6y
2,292
6%
unknown
1,731
4%
illiterate
18
<1%
ALL
41,188
100%
CATEGORICAL ASSOCIATIONS
(UNCERTAINTY COEFFICIENT, 0 to 1)
education
PROVIDES INFORMATION ON...

job
0.20
default
0.05
marital
0.02
month
0.02
contact
0.01
emp.var.rate
0.01
deposit
0.01
poutcome
0.00
previous
0.00
day_of_week
0.00
housing
0.00
loan
0.00

THESE FEATURES
GIVE INFORMATION
ON education:

job
0.23
month
0.02
default
0.02
marital
0.01
emp.var.rate
0.01
contact
0.00
deposit
0.00
poutcome
0.00
previous
0.00
day_of_week
0.00
housing
0.00
loan
0.00

NUMERICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)
education
CORRELATION RATIO WITH...

age
0.27
cons.conf.idx
0.12
cons.price.idx
0.10
nr.employed
0.06
euribor3m
0.05
pdays
0.05
duration
0.02
campaign
0.01
default
MISSING:
---
TOP CATEGORIES

no
32,588
79%
unknown
8,597
21%
yes
3
<1%
ALL
41,188
100%
CATEGORICAL ASSOCIATIONS
(UNCERTAINTY COEFFICIENT, 0 to 1)
default
PROVIDES INFORMATION ON...

emp.var.rate
0.02
deposit
0.02
education
0.02
poutcome
0.02
previous
0.01
contact
0.01
job
0.01
marital
0.01
month
0.01
housing
0.00
day_of_week
0.00
loan
0.00

THESE FEATURES
GIVE INFORMATION
ON default:

emp.var.rate
0.06
education
0.05
job
0.04
month
0.03
marital
0.02
contact
0.02
poutcome
0.01
previous
0.01
deposit
0.01
day_of_week
0.00
housing
0.00
loan
0.00

NUMERICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)
default
CORRELATION RATIO WITH...

euribor3m
0.20
nr.employed
0.19
cons.price.idx
0.17
age
0.17
pdays
0.07
campaign
0.03
cons.conf.idx
0.03
duration
0.01
housing
MISSING:
---
TOP CATEGORIES

yes
21,576
52%
no
18,622
45%
unknown
990
2%
ALL
41,188
100%
CATEGORICAL ASSOCIATIONS
(UNCERTAINTY COEFFICIENT, 0 to 1)
housing
PROVIDES INFORMATION ON...

loan
0.21
contact
0.01
emp.var.rate
0.00
month
0.00
previous
0.00
poutcome
0.00
default
0.00
education
0.00
deposit
0.00
day_of_week
0.00
job
0.00
marital
0.00

THESE FEATURES
GIVE INFORMATION
ON housing:

loan
0.15
contact
0.00
emp.var.rate
0.00
month
0.00
previous
0.00
job
0.00
education
0.00
poutcome
0.00
day_of_week
0.00
default
0.00
marital
0.00
deposit
0.00

NUMERICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)
housing
CORRELATION RATIO WITH...

cons.price.idx
0.08
euribor3m
0.06
nr.employed
0.05
cons.conf.idx
0.03
campaign
0.01
duration
0.01
pdays
0.00
age
0.00
loan
MISSING:
---
TOP CATEGORIES

no
33,950
82%
yes
6,248
15%
unknown
990
2%
ALL
41,188
100%
CATEGORICAL ASSOCIATIONS
(UNCERTAINTY COEFFICIENT, 0 to 1)
loan
PROVIDES INFORMATION ON...

housing
0.15
contact
0.00
month
0.00
emp.var.rate
0.00
previous
0.00
job
0.00
default
0.00
education
0.00
day_of_week
0.00
marital
0.00
poutcome
0.00
deposit
0.00

THESE FEATURES
GIVE INFORMATION
ON loan:

housing
0.21
month
0.00
emp.var.rate
0.00
job
0.00
contact
0.00
education
0.00
day_of_week
0.00
previous
0.00
default
0.00
marital
0.00
poutcome
0.00
deposit
0.00

NUMERICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)
loan
CORRELATION RATIO WITH...

cons.price.idx
0.02
cons.conf.idx
0.02
age
0.01
campaign
0.01
duration
0.00
nr.employed
0.00
pdays
0.00
euribor3m
0.00
contact
MISSING:
---
TOP CATEGORIES

cellular
26,144
63%
telephone
15,044
37%
ALL
41,188
100%
CATEGORICAL ASSOCIATIONS
(UNCERTAINTY COEFFICIENT, 0 to 1)
contact
PROVIDES INFORMATION ON...

emp.var.rate
0.17
month
0.11
poutcome
0.08
previous
0.07
deposit
0.03
default
0.02
housing
0.00
education
0.00
job
0.00
marital
0.00
day_of_week
0.00
loan
0.00

THESE FEATURES
GIVE INFORMATION
ON contact:

emp.var.rate
0.41
month
0.31
previous
0.06
poutcome
0.06
deposit
0.02
default
0.01
job
0.01
education
0.01
housing
0.01
marital
0.00
day_of_week
0.00
loan
0.00

NUMERICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)
contact
CORRELATION RATIO WITH...

cons.price.idx
0.59
euribor3m
0.40
nr.employed
0.27
cons.conf.idx
0.25
pdays
0.10
campaign
0.08
duration
0.03
age
0.01
month
MISSING:
---
TOP CATEGORIES

may
13,769
33%
jul
7,174
17%
aug
6,178
15%
jun
5,318
13%
nov
4,101
10%
apr
2,632
6%
oct
718
2%
sep
570
1%
mar
546
1%
dec
182
<1%
ALL
41,188
100%
CATEGORICAL ASSOCIATIONS
(UNCERTAINTY COEFFICIENT, 0 to 1)
month
PROVIDES INFORMATION ON...

emp.var.rate
0.69
contact
0.31
poutcome
0.10
previous
0.10
deposit
0.08
default
0.03
job
0.03
education
0.02
day_of_week
0.01
marital
0.00
housing
0.00
loan
0.00

THESE FEATURES
GIVE INFORMATION
ON month:

emp.var.rate
0.61
contact
0.11
job
0.03
poutcome
0.03
previous
0.03
education
0.02
deposit
0.01
default
0.01
day_of_week
0.01
marital
0.00
housing
0.00
loan
0.00

NUMERICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)
month
CORRELATION RATIO WITH...

nr.employed
0.65
cons.conf.idx
0.61
cons.price.idx
0.61
euribor3m
0.58
pdays
0.20
campaign
0.16
age
0.13
duration
0.07
day_of_week
MISSING:
---
TOP CATEGORIES

thu
8,623
21%
mon
8,514
21%
wed
8,134
20%
tue
8,090
20%
fri
7,827
19%
ALL
41,188
100%
CATEGORICAL ASSOCIATIONS
(UNCERTAINTY COEFFICIENT, 0 to 1)
day_of_week
PROVIDES INFORMATION ON...

month
0.01
emp.var.rate
0.00
contact
0.00
deposit
0.00
previous
0.00
poutcome
0.00
education
0.00
job
0.00
housing
0.00
default
0.00
marital
0.00
loan
0.00

THESE FEATURES
GIVE INFORMATION
ON day_of_week:

month
0.01
emp.var.rate
0.00
contact
0.00
education
0.00
job
0.00
previous
0.00
deposit
0.00
marital
0.00
housing
0.00
poutcome
0.00
default
0.00
loan
0.00

NUMERICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)
day_of_week
CORRELATION RATIO WITH...

cons.conf.idx
0.07
euribor3m
0.04
campaign
0.04
age
0.04
nr.employed
0.03
duration
0.03
cons.price.idx
0.02
pdays
0.01
duration
MISSING:
---
>
NUMERICAL ASSOCIATIONS
(PEARSON, -1 to 1)

campaign
-0.07
pdays
0.05
nr.employed
-0.04
euribor3m
-0.03
cons.conf.idx
-0.01
cons.price.idx
0.01
age
-0.00

CATEGORICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)

deposit
0.41
month
0.07
emp.var.rate
0.06
poutcome
0.05
day_of_week
0.03
job
0.03
previous
0.03
contact
0.03
education
0.02
marital
0.01
default
0.01
housing
0.01
loan
0.00
MOST FREQUENT VALUES

85
170
0.4%
90
170
0.4%
136
168
0.4%
73
167
0.4%
124
164
0.4%
87
162
0.4%
72
161
0.4%
104
161
0.4%
111
160
0.4%
106
159
0.4%
109
158
0.4%
97
158
0.4%
122
157
0.4%
135
156
0.4%
92
156
0.4%
SMALLEST VALUES

0
4
<0.1%
1
3
<0.1%
2
1
<0.1%
3
3
<0.1%
4
12
<0.1%
5
30
<0.1%
6
37
<0.1%
7
54
0.1%
8
69
0.2%
9
77
0.2%
10
72
0.2%
11
81
0.2%
12
65
0.2%
13
77
0.2%
14
70
0.2%
LARGEST VALUES

4918
1
<0.1%
4199
1
<0.1%
3785
1
<0.1%
3643
1
<0.1%
3631
1
<0.1%
3509
1
<0.1%
3422
1
<0.1%
3366
1
<0.1%
3322
1
<0.1%
3284
1
<0.1%
3253
1
<0.1%
3183
1
<0.1%
3094
1
<0.1%
3078
1
<0.1%
3076
1
<0.1%
campaign
MISSING:
---
>
NUMERICAL ASSOCIATIONS
(PEARSON, -1 to 1)

nr.employed
0.14
euribor3m
0.14
cons.price.idx
0.13
duration
-0.07
pdays
-0.04
cons.conf.idx
-0.01
age
0.00

CATEGORICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)

emp.var.rate
0.18
month
0.16
poutcome
0.09
previous
0.09
contact
0.08
deposit
0.07
day_of_week
0.04
default
0.03
job
0.03
marital
0.01
housing
0.01
education
0.01
loan
0.01
MOST FREQUENT VALUES

1
17,642
42.8%
2
10,570
25.7%
3
5,341
13.0%
4
2,651
6.4%
5
1,599
3.9%
6
979
2.4%
7
629
1.5%
8
400
1.0%
9
283
0.7%
10
225
0.5%
11
177
0.4%
12
125
0.3%
13
92
0.2%
14
69
0.2%
17
58
0.1%
SMALLEST VALUES

1
17,642
42.8%
2
10,570
25.7%
3
5,341
13.0%
4
2,651
6.4%
5
1,599
3.9%
6
979
2.4%
7
629
1.5%
8
400
1.0%
9
283
0.7%
10
225
0.5%
11
177
0.4%
12
125
0.3%
13
92
0.2%
14
69
0.2%
15
51
0.1%
LARGEST VALUES

56
1
<0.1%
43
2
<0.1%
42
2
<0.1%
41
1
<0.1%
40
2
<0.1%
39
1
<0.1%
37
1
<0.1%
35
5
<0.1%
34
3
<0.1%
33
4
<0.1%
32
4
<0.1%
31
7
<0.1%
30
7
<0.1%
29
10
<0.1%
28
8
<0.1%
pdays
MISSING:
---
>
NUMERICAL ASSOCIATIONS
(PEARSON, -1 to 1)

nr.employed
-0.32
euribor3m
-0.25
cons.conf.idx
0.06
duration
0.05
campaign
-0.04
cons.price.idx
-0.04
age
0.02

CATEGORICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)

poutcome
0.74
previous
0.50
emp.var.rate
0.37
deposit
0.27
month
0.20
job
0.12
contact
0.10
default
0.07
education
0.05
marital
0.04
day_of_week
0.01
housing
0.00
loan
0.00
MOST FREQUENT VALUES

0
39,688
96.4%
3
439
1.1%
6
412
1.0%
4
118
0.3%
9
64
0.2%
2
61
0.1%
7
60
0.1%
12
58
0.1%
10
52
0.1%
5
46
0.1%
13
36
<0.1%
11
28
<0.1%
1
26
<0.1%
15
24
<0.1%
14
20
<0.1%
SMALLEST VALUES

0
39,688
96.4%
1
26
<0.1%
2
61
0.1%
3
439
1.1%
4
118
0.3%
5
46
0.1%
6
412
1.0%
7
60
0.1%
8
18
<0.1%
9
64
0.2%
10
52
0.1%
11
28
<0.1%
12
58
0.1%
13
36
<0.1%
14
20
<0.1%
LARGEST VALUES

27
1
<0.1%
26
1
<0.1%
25
1
<0.1%
22
3
<0.1%
21
2
<0.1%
20
1
<0.1%
19
3
<0.1%
18
7
<0.1%
17
8
<0.1%
16
11
<0.1%
15
24
<0.1%
14
20
<0.1%
13
36
<0.1%
12
58
0.1%
11
28
<0.1%
previous
MISSING:
---
TOP CATEGORIES

0
35,563
86%
1
4,561
11%
2
754
2%
3
216
<1%
4
70
<1%
5
18
<1%
6
5
<1%
7
1
<1%
ALL
41,188
100%
CATEGORICAL ASSOCIATIONS
(UNCERTAINTY COEFFICIENT, 0 to 1)
previous
PROVIDES INFORMATION ON...

poutcome
0.85
emp.var.rate
0.10
contact
0.06
deposit
0.05
month
0.03
default
0.01
job
0.00
marital
0.00
education
0.00
housing
0.00
day_of_week
0.00
loan
0.00

THESE FEATURES
GIVE INFORMATION
ON previous:

poutcome
0.83
emp.var.rate
0.32
month
0.10
contact
0.07
deposit
0.04
job
0.02
default
0.01
education
0.00
marital
0.00
housing
0.00
day_of_week
0.00
loan
0.00

NUMERICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)
previous
CORRELATION RATIO WITH...

nr.employed
0.52
pdays
0.50
euribor3m
0.49
cons.price.idx
0.33
cons.conf.idx
0.14
campaign
0.09
age
0.03
duration
0.03
poutcome
MISSING:
---
TOP CATEGORIES

nonexistent
35,563
86%
failure
4,252
10%
success
1,373
3%
ALL
41,188
100%
CATEGORICAL ASSOCIATIONS
(UNCERTAINTY COEFFICIENT, 0 to 1)
poutcome
PROVIDES INFORMATION ON...

previous
0.83
emp.var.rate
0.10
deposit
0.09
contact
0.06
month
0.03
default
0.01
job
0.00
marital
0.00
education
0.00
housing
0.00
day_of_week
0.00
loan
0.00

THESE FEATURES
GIVE INFORMATION
ON poutcome:

previous
0.85
emp.var.rate
0.33
month
0.10
contact
0.08
deposit
0.06
job
0.02
default
0.02
education
0.00
marital
0.00
housing
0.00
day_of_week
0.00
loan
0.00

NUMERICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)
poutcome
CORRELATION RATIO WITH...

pdays
0.74
nr.employed
0.51
euribor3m
0.49
cons.price.idx
0.31
cons.conf.idx
0.18
campaign
0.09
duration
0.05
age
0.04
emp.var.rate
MISSING:
---
TOP CATEGORIES

1.4
16,234
39%
-1.8
9,184
22%
1.1
7,763
19%
-0.1
3,683
9%
-2.9
1,663
4%
-3.4
1,071
3%
-1.7
773
2%
-1.1
635
2%
-3.0
172
<1%
-0.2
10
<1%
ALL
41,188
100%
CATEGORICAL ASSOCIATIONS
(UNCERTAINTY COEFFICIENT, 0 to 1)
emp.var.rate
PROVIDES INFORMATION ON...

month
0.61
contact
0.41
poutcome
0.33
previous
0.32
deposit
0.16
default
0.06
job
0.02
education
0.01
marital
0.01
housing
0.00
day_of_week
0.00
loan
0.00

THESE FEATURES
GIVE INFORMATION
ON emp.var.rate:

month
0.69
contact
0.17
previous
0.10
poutcome
0.10
deposit
0.03
job
0.03
default
0.02
education
0.01
marital
0.01
day_of_week
0.00
housing
0.00
loan
0.00

NUMERICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)
emp.var.rate
CORRELATION RATIO WITH...

euribor3m
1.00
nr.employed
0.99
cons.price.idx
0.88
cons.conf.idx
0.83
pdays
0.37
campaign
0.18
age
0.12
duration
0.06
cons.price.idx
MISSING:
---
>
NUMERICAL ASSOCIATIONS
(PEARSON, -1 to 1)

euribor3m
0.69
nr.employed
0.52
campaign
0.13
cons.conf.idx
0.06
pdays
-0.04
duration
0.01
age
0.00

CATEGORICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)

emp.var.rate
0.88
month
0.61
contact
0.59
previous
0.33
poutcome
0.31
default
0.17
deposit
0.14
job
0.12
education
0.10
housing
0.08
marital
0.06
day_of_week
0.02
loan
0.02
MOST FREQUENT VALUES

93.994
7,763
18.8%
93.91799999999999
6,685
16.2%
92.89299999999999
5,794
14.1%
93.444
5,175
12.6%
94.465
4,374
10.6%
93.2
3,616
8.8%
93.075
2,458
6.0%
92.20100000000001
770
1.9%
92.963
715
1.7%
92.431
447
1.1%
92.649
357
0.9%
94.215
311
0.8%
94.199
303
0.7%
92.84299999999999
282
0.7%
92.37899999999999
267
0.6%
SMALLEST VALUES

92.20100000000001
770
1.9%
92.37899999999999
267
0.6%
92.431
447
1.1%
92.469
178
0.4%
92.649
357
0.9%
92.713
172
0.4%
92.756
10
<0.1%
92.84299999999999
282
0.7%
92.89299999999999
5,794
14.1%
92.963
715
1.7%
93.075
2,458
6.0%
93.2
3,616
8.8%
93.369
264
0.6%
93.444
5,175
12.6%
93.749
174
0.4%
LARGEST VALUES

94.76700000000001
128
0.3%
94.601
204
0.5%
94.465
4,374
10.6%
94.215
311
0.8%
94.199
303
0.7%
94.055
229
0.6%
94.027
233
0.6%
93.994
7,763
18.8%
93.91799999999999
6,685
16.2%
93.876
212
0.5%
93.79799999999999
67
0.2%
93.749
174
0.4%
93.444
5,175
12.6%
93.369
264
0.6%
93.2
3,616
8.8%
cons.conf.idx
MISSING:
---
>
NUMERICAL ASSOCIATIONS
(PEARSON, -1 to 1)

euribor3m
0.28
age
0.13
nr.employed
0.10
pdays
0.06
cons.price.idx
0.06
campaign
-0.01
duration
-0.01

CATEGORICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)

emp.var.rate
0.83
month
0.61
contact
0.25
poutcome
0.18
job
0.16
previous
0.14
education
0.12
day_of_week
0.07
marital
0.06
deposit
0.05
housing
0.03
default
0.03
loan
0.02
MOST FREQUENT VALUES

-36.4
7,763
18.8%
-42.7
6,685
16.2%
-46.2
5,794
14.1%
-36.1
5,175
12.6%
-41.8
4,374
10.6%
-42.0
3,616
8.8%
-47.1
2,458
6.0%
-31.4
770
1.9%
-40.8
715
1.7%
-26.9
447
1.1%
-30.1
357
0.9%
-40.3
311
0.8%
-37.5
303
0.7%
-50.0
282
0.7%
-29.8
267
0.6%
SMALLEST VALUES

-50.8
128
0.3%
-50.0
282
0.7%
-49.5
204
0.5%
-47.1
2,458
6.0%
-46.2
5,794
14.1%
-45.9
10
<0.1%
-42.7
6,685
16.2%
-42.0
3,616
8.8%
-41.8
4,374
10.6%
-40.8
715
1.7%
-40.4
67
0.2%
-40.3
311
0.8%
-40.0
212
0.5%
-39.8
229
0.6%
-38.3
233
0.6%
LARGEST VALUES

-26.9
447
1.1%
-29.8
267
0.6%
-30.1
357
0.9%
-31.4
770
1.9%
-33.0
172
0.4%
-33.6
178
0.4%
-34.6
174
0.4%
-34.8
264
0.6%
-36.1
5,175
12.6%
-36.4
7,763
18.8%
-37.5
303
0.7%
-38.3
233
0.6%
-39.8
229
0.6%
-40.0
212
0.5%
-40.3
311
0.8%
euribor3m
MISSING:
---
>
NUMERICAL ASSOCIATIONS
(PEARSON, -1 to 1)

nr.employed
0.95
cons.price.idx
0.69
cons.conf.idx
0.28
pdays
-0.25
campaign
0.14
duration
-0.03
age
0.01

CATEGORICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)

emp.var.rate
1.00
month
0.58
poutcome
0.49
previous
0.49
contact
0.40
deposit
0.31
default
0.20
job
0.19
marital
0.11
housing
0.06
education
0.05
day_of_week
0.04
loan
0.00
MOST FREQUENT VALUES

4.857
2,868
7.0%
4.962
2,613
6.3%
4.963
2,487
6.0%
4.961
1,902
4.6%
4.856
1,210
2.9%
4.9639999999999995
1,175
2.9%
1.405
1,169
2.8%
4.965
1,071
2.6%
4.864
1,044
2.5%
4.96
1,013
2.5%
4.968
992
2.4%
4.959
895
2.2%
4.86
892
2.2%
4.855
840
2.0%
4.0760000000000005
822
2.0%
SMALLEST VALUES

0.634
8
<0.1%
0.635
43
0.1%
0.636
14
<0.1%
0.637
6
<0.1%
0.638
7
<0.1%
0.639
16
<0.1%
0.64
10
<0.1%
0.642
35
<0.1%
0.643
23
<0.1%
0.644
38
<0.1%
0.645
26
<0.1%
0.6459999999999999
49
0.1%
0.649
10
<0.1%
0.65
12
<0.1%
0.6509999999999999
7
<0.1%
LARGEST VALUES

5.045
9
<0.1%
5.0
7
<0.1%
4.97
172
0.4%
4.968
992
2.4%
4.967
643
1.6%
4.966
622
1.5%
4.965
1,071
2.6%
4.9639999999999995
1,175
2.9%
4.963
2,487
6.0%
4.962
2,613
6.3%
4.961
1,902
4.6%
4.96
1,013
2.5%
4.959
895
2.2%
4.958
581
1.4%
4.957
537
1.3%
nr.employed
MISSING:
---
>
NUMERICAL ASSOCIATIONS
(PEARSON, -1 to 1)

euribor3m
0.95
cons.price.idx
0.52
pdays
-0.32
campaign
0.14
cons.conf.idx
0.10
duration
-0.04
age
-0.02

CATEGORICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)

emp.var.rate
0.99
month
0.65
previous
0.52
poutcome
0.51
deposit
0.35
contact
0.27
job
0.23
default
0.19
marital
0.10
education
0.06
housing
0.05
day_of_week
0.03
loan
0.00
MOST FREQUENT VALUES

5228.1
16,234
39.4%
5099.1
8,534
20.7%
5191.0
7,763
18.8%
5195.8
3,683
8.9%
5076.2
1,663
4.0%
5017.5
1,071
2.6%
4991.6
773
1.9%
5008.7
650
1.6%
4963.6
635
1.5%
5023.5
172
0.4%
5176.3
10
<0.1%
SMALLEST VALUES

4963.6
635
1.5%
4991.6
773
1.9%
5008.7
650
1.6%
5017.5
1,071
2.6%
5023.5
172
0.4%
5076.2
1,663
4.0%
5099.1
8,534
20.7%
5176.3
10
<0.1%
5191.0
7,763
18.8%
5195.8
3,683
8.9%
5228.1
16,234
39.4%
LARGEST VALUES

5228.1
16,234
39.4%
5195.8
3,683
8.9%
5191.0
7,763
18.8%
5176.3
10
<0.1%
5099.1
8,534
20.7%
5076.2
1,663
4.0%
5023.5
172
0.4%
5017.5
1,071
2.6%
5008.7
650
1.6%
4991.6
773
1.9%
4963.6
635
1.5%
deposit
MISSING:
---
TOP CATEGORIES

0
36,548
89%
1
4,640
11%
ALL
41,188
100%
CATEGORICAL ASSOCIATIONS
(UNCERTAINTY COEFFICIENT, 0 to 1)
deposit
PROVIDES INFORMATION ON...

poutcome
0.06
previous
0.04
emp.var.rate
0.03
contact
0.02
month
0.01
default
0.01
job
0.00
marital
0.00
education
0.00
day_of_week
0.00
housing
0.00
loan
0.00

THESE FEATURES
GIVE INFORMATION
ON deposit:

emp.var.rate
0.16
poutcome
0.09
month
0.08
previous
0.05
contact
0.03
job
0.03
default
0.02
education
0.01
marital
0.00
day_of_week
0.00
housing
0.00
loan
0.00

NUMERICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)
deposit
CORRELATION RATIO WITH...

duration
0.41
nr.employed
0.35
euribor3m
0.31
pdays
0.27
cons.price.idx
0.14
campaign
0.07
cons.conf.idx
0.05
age
0.03

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

\(\~\)

EDA analysis:

From the target variable, we can see that 88.73 % of the customers haven’t subscribed to the financial product offered, therefore we have an imbalanced dataset. This situation will be refectled in the train, validation and test sets when modelling. For that we will use either the undersampling or oversampling techniques.

From the correlation plot we can observe that there are important correlations in several characteristics with respect to the variable “deposit” as well as between them. The above correlation matrix was plotted with all variables. Clearly, “campaign outcome” has a strong correlation with “duration”, a moderate correlation with “previous contacts”, and mild correlations between “balance”, “month of contact” and “number of campaign”.

grouped_gghistostats(
    data = dataset,
    x = age,
    grouping.var = deposit, # grouping variable
    normal.curve = TRUE, # superimpose a normal distribution curve
    normal.curve.args = list(color = "red", size = 1),
    ggtheme = ggthemes::theme_tufte(),
    plotgrid.args = list(nrow = 1),
    ggstatsplot.layer = FALSE,
    ggplot.component = list(theme(text = element_text(size = 6.3))),
    annotation.args = list(title = "Age distribution by deposit")
)

In the age variable, we observe that age is not an element that has much difference between customers who took deposits from those who did not, its average is around 40 years. However, both are 2 different groups statistically speaking due to the low p-value in the t test. The only remarkable difference we can highlight, is the fact that most of the old people subscribed to a deposit.

ggbarstats(
    data = dataset,
    x = education,
    y = deposit,
    title = "Education by deposit subscription",
    legend.title = "Educational level",
    ggtheme = hrbrthemes::theme_ipsum_pub()
)

Education shows a difference between the different levels. For example, some types of clients present an efficiency of 13.72% (university students) while those with basic levels of studies do not reach 9% in some cases. We could say that we should aim to offer this product to clients who have college, professional or high school levels.

In the case of type of work, retirees, students, unemployed and management positions are those who lead with the best results to offer the financial product.

Regarding marital status, we could infer that single clients are a bit more sensitive to acquire the offer of term deposits.

The month variable is a good indicator. Note that the number of contacts and their efficiency varies strongly from month to month. For example, in March we obtained 50% efficiency with very few contacts made (only 500), however, in May 14 thousand contacts were made with only an efficiency of 6.4%

Regarding the variable pdays we can say that most of the clients were contacted for the first time.

ggbarstats(
    data = dataset,
    x = poutcome,
    y = deposit,
    title = "Outcome of the previous marketing campaign by current deposit subscription",
    legend.title = "Educational level",
    ggtheme = hrbrthemes::theme_ipsum_pub()
)

Out of the people that subscribed for a deposit, only 19% of these customers had a previous deposit (sucessfull result) in the previous campaign.

As for the next section regarding the data pre-processing there is no need to impute the data since we don’t have missing values, regarding the outliers, we can see few outliers in the variable “age” and we will accept them since there are no regulations regarding the age of a customer to subscribe to a term deposit. If this case study would be credit-risk related, then we would have to discard these outliers or manipulate them.

Data manipulation

One-Hot Encoding of categorical variables

Most of our categorical data are variables that contain label values rather than numeric values. The number of possible values is often limited to a fixed set. The problem when modeling using categorical data is that some algorithms can work with categorical data directly, and a preliminary transformation of the variables has to be done prior the modeling process.

For example, a decision tree can be trained directly from categorical data with no data transform required (this depends on the specific implementation).

To this, many machine learning algorithms cannot operate on label data directly. They require all input variables and output variables to be numeric. In general, this is mostly a constraint of the efficient implementation of machine learning algorithms rather than hard limitations on the algorithms themselves.

The main idea is to split the column which contains categorical data to many columns depending on the number of categories present in that column. Each column contains “0” or “1” corresponding to which column it has been placed.

First: we create two data sets for numeric and non-numeric data

numerical = dataset.select_dtypes(exclude=['object'])
categorical = dataset.select_dtypes(include=['object'])

Second: One-hot encode the non-numeric columns

onehot = pd.get_dummies(categorical)

Third: Union the one-hot encoded columns to the numeric ones

df = pd.concat([numerical, onehot], axis=1)

Fourth: Print the columns in the new data set

glimpse(py$df)
Rows: 41,188
Columns: 64
$ age                           <dbl> 56, 57, 37, 40, 56, 45, 59, 41, 24, 2...
$ duration                      <dbl> 261, 149, 226, 151, 307, 198, 139, 21...
$ campaign                      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
$ pdays                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ previous                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ emp.var.rate                  <dbl> 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1....
$ cons.price.idx                <dbl> 93.994, 93.994, 93.994, 93.994, 93.99...
$ cons.conf.idx                 <dbl> -36.4, -36.4, -36.4, -36.4, -36.4, -3...
$ euribor3m                     <dbl> 4.857, 4.857, 4.857, 4.857, 4.857, 4....
$ nr.employed                   <dbl> 5191, 5191, 5191, 5191, 5191, 5191, 5...
$ deposit                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ job_admin.                    <int> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0...
$ `job_blue-collar`             <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1...
$ job_entrepreneur              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ job_housemaid                 <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ job_management                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ job_retired                   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ `job_self-employed`           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ job_services                  <int> 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0...
$ job_student                   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ job_technician                <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0...
$ job_unemployed                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ job_unknown                   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ marital_divorced              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ marital_married               <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0...
$ marital_single                <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1...
$ marital_unknown               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ education_basic.4y            <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ education_basic.6y            <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ education_basic.9y            <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0...
$ education_high.school         <int> 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1...
$ education_illiterate          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ education_professional.course <int> 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0...
$ education_university.degree   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ education_unknown             <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0...
$ default_no                    <int> 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1...
$ default_unknown               <int> 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0...
$ default_yes                   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ housing_no                    <int> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1...
$ housing_unknown               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ housing_yes                   <int> 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0...
$ loan_no                       <int> 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0...
$ loan_unknown                  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ loan_yes                      <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1...
$ contact_cellular              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ contact_telephone             <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
$ month_apr                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ month_aug                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ month_dec                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ month_jul                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ month_jun                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ month_mar                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ month_may                     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
$ month_nov                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ month_oct                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ month_sep                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ day_of_week_fri               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ day_of_week_mon               <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
$ day_of_week_thu               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ day_of_week_tue               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ day_of_week_wed               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ poutcome_failure              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ poutcome_nonexistent          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
$ poutcome_success              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...

With this method we end up with a larger dataframe of 64 columns.

df.shape
(41188, 64)

Creation of Training, Validation and Test datasets

In any machine learning project, after a EDA a common practice is to split the dataset into training, test and validation (if applies). For this we set a seed (123) for sampling reproducibility and split the one-hot encoded dataset in a training, validation and test set, by using pandas and numpy. The training set will contain 70% of the data, 15% for validation and the remaining 15% for our test set.

# We create the X and y data sets
X = df.loc[ : , df.columns != 'deposit']
y = df[['deposit']]

# Create training, evaluation and test sets
X_train, test_X, y_train, test_y = train_test_split(X, y, test_size=.3, random_state=123)
X_eval, X_test, y_eval, y_test = train_test_split(test_X, test_y, test_size=.5, random_state=123)

In order to check how imbalanced is our training dataset in terms of our target variable “deposit”, we run the following code to calculate the percentage of customers that did not subscribe to a term deposit in the train set.

# percentage of defaults and non-defaults in the training set
round(y_train['deposit'].value_counts()*100/len(y_train['deposit']), 2)
0    88.75
1    11.25
Name: deposit, dtype: float64

We find that 88.75% of the customer did not subscribe to a term deposit, and 11.25% got this financial product. For modeling purposes our dataset cannot be imbalanced as it would bias our estimations since many algorithms assume a balanced or closely balanced dataset. In the next section we will proceed with a technique that will allow us to balance our training set.

Balancing dataset

Imbalanced data typically refers to a problem with classification problems where the classes are not represented equally. Most classification data sets do not have exactly equal number of instances in each class, but a small difference often does not matter, however, our dataset is imbalanced.

Our imbalanced dataset is not adequate for predictive modeling, as mentioned above, most of the machine learning algorithms used for classification were designed around the assumption of an equal number of examples for each class. This results in models that have poor predictive performance, specifically for the minority class. This is a problem because typically, the minority class is more important and therefore the problem is more sensitive to classification errors for the minority class than the majority class.

For balancing our dataset we will use the undersampling technique that consists in sampling from the majority class in order to keep only a part of these points. This will reduce the number of rows of our dataset, however, we can afford to apply such method because our training set is quite large.

First we create data sets for deposits and no-deposits:

X_y_train = pd.concat([X_train.reset_index(drop = True), y_train.reset_index(drop = True)], axis = 1)
count_no_deposit, count_deposit = X_y_train['deposit'].value_counts()
no_deposit = X_y_train[X_y_train['deposit'] == 0]
deposit = X_y_train[X_y_train['deposit'] == 1]

Second we undersample the no-deposits

no_deposit_under = no_deposit.sample(count_deposit)

Third, we concatenate the undersampled nondefaults with defaults

train_balanced = pd.concat([no_deposit_under.reset_index(drop = True), deposit.reset_index(drop = True)], axis = 0)

Lastly, we check the proportion of deposit and no deposits in our target variable:

round(train_balanced['deposit'].value_counts()*100/len(train_balanced['deposit']), 2)
1    50.0
0    50.0
Name: deposit, dtype: float64

We get a balanced training dataset with 50% of customers that subscribed to a term deposit, and another 50% that did not. However, this undersampled but balanced dataset has now 6488 rows.

From our balanced train dataset we set our X_train feature matrix that contains all independent variables by running the following code:

X_train = train_balanced.loc[ : , train_balanced.columns != 'deposit']
y_train = train_balanced[['deposit']]

Statistical Learning Methods

In this section we will use supervised learning algorithms in order to predict and estimate an output based on one or more inputs. In our case, we want to predict whether a customer will subscribe to a term deposit based on some input data described before.

Logistic Regression Model

Logistic regression models predicts the probability of the default class. In our case, this model will predict the probability of a customer subscribing to a term deposit.

We start by training the logistic regression model on the training data.

clf_logistic = LogisticRegression(max_iter = 100000).fit(X_train, np.ravel(y_train))

Based on the trained model, we predict the probability that a customer has to subscribing to a term deposits using validation data.

preds = clf_logistic.predict_proba(X_eval)

The function used in the previous chunk of code predict_proba provides probabilities for in a range of (0,1) including float numbers. The first column refers to the probability for a customer of not getting a term deposit, and the second column is the probability of subscribing to a term deposit. Now, we create a dataframe of predictions of subscribing to a term deposit, and the true values of people that subscribed to a term deposit:

preds_df = pd.DataFrame(preds[:,1], columns = ['prob_accept_deposit'])
true_df = y_eval
pred_comparison = pd.concat([true_df.reset_index(drop = True), preds_df], axis = 1)
pred_comparison.head(10)
   deposit  prob_accept_deposit
0        0             0.041983
1        0             0.036563
2        0             0.770255
3        0             0.057986
4        0             0.008412
5        0             0.417765
6        0             0.020624
7        0             0.022901
8        0             0.068139
9        1             0.179756

We are interested in checking the classification report of this model. For this, we reassign the probability of accepting a deposit based on the threshold 0.5 which is the middle point between 0 and 1, this is a common approach is many other algorithms. In other words, any estimated probability higher than 0.5 will be assigned as a deposit (1), otherwise as a no-deposit (0).

preds_df['prob_accept_deposit'] = preds_df['prob_accept_deposit'].apply(lambda x: 1 if x > 0.5 else 0)

We can slightly compare how differ the estimations with the real values by the difference in deposits.

Count of estimated deposits by the logistic model:

print(preds_df['prob_accept_deposit'].value_counts())
0    4779
1    1399
Name: prob_accept_deposit, dtype: int64

Count of real deposits in our test set:

print(true_df['deposit'].value_counts())
0    5496
1     682
Name: deposit, dtype: int64

Choosing the right metric is crucial while evaluating machine learning (ML) models. Various metrics are proposed to evaluate ML models in different applications

By the nature of this case study we are having a classification problem. Therefore we will choose as our metric for model performance Recall (aka Sensitivity, TPR or True Positive Rate) which is defined as the fraction of samples from a class which are correctly predicted by the model.

The Recall metric provides us with the answer to a the question “Of all of the positive samples, what proportion did I predict correctly?”. It concentrates on the false negatives (FN) and are observations that our algorithm missed. The lower the number of FN is, the better prediction power of our model. In this case study we have 2 classes in our target variable, whether a customer subscribes to a term deposit or not. To this we will analyse the same metric for both classes.

\(Recall(Deposit) = \\frac{True Positives}{True Positives + False Negatives}\)

\(Recall(No-Deposit) = \\frac{True Negatives}{True Negatives+ False Positives}\) Another important metric that will be also analysed but not taken in consideration when choosing the models is the Accuracy, this is perhaps the simplest metrics one can imagine, and is defined as the number of correct predictions divided by the total number of predictions.

\(Accuracy = \\frac{True Positives + True Negatives}{True Positives + False Positives + True Negatives + False Negatives}\) In order to check the performance of our model it is necessary to check the classification report, by running the following chunk of code:

target_names = ['No-deposit', 'Deposit']
print(classification_report(y_eval, preds_df['prob_accept_deposit'], target_names=target_names))
              precision    recall  f1-score   support

  No-deposit       0.99      0.86      0.92      5496
     Deposit       0.44      0.90      0.59       682

    accuracy                           0.86      6178
   macro avg       0.71      0.88      0.75      6178
weighted avg       0.92      0.86      0.88      6178

We check the accuracy score the model as follows, although this is not our metric of interest. We check this value by observing the above table or by running the following chunk of code:

print(clf_logistic.score(X_eval, y_eval).round(2))
0.86

It means that this model correctly predicts 85% of the classes. Finally, we check the confusion matrix which is a table with 4 different combinations of predicted and actual values

Where:

TN = True Negatives

TP = True Positives

FN = False Negatives

FP = False positives

# Print the confusion matrix
matrix = confusion_matrix(y_eval,preds_df['prob_accept_deposit'])
print(matrix)
[[4709  787]
 [  70  612]]

TN = 4696

TP = 611

FN = 71

FP = 800

We are interested to evaluate our model based on the metric recall aka true positive rate for subscribing to a term deposits which can visualised in the classification report:

recall_log_reg_1 = round(matrix[1][1]/(matrix[1][1]+matrix[1][0]), 2)
print(recall_log_reg_1)
0.9

We are interested in enhancing this metric. As seen in before, the cut-off point for assigning the categories from the predictions was 0.5. The cut-off point is the point that will indicate whether a customer with certain characteristics will subscribe to a term deposit. If the probability becomes more than the cut-off point, the customer will be in the class of “Deposit”, otherwise will be in the class of “No-deposit”.

We can however set an optimal threshold for the classification and improve our recall metric. Before proceding with this we must reset the preds_df dataframe with the original predicted probabilities, and overwrite those that resulted from the previous arbitrary cut-off

preds_df = pd.DataFrame(preds[:,1], columns = ['prob_accept_deposit']).reset_index(drop = True)

First we run a for loop that evaluates the model’s performance with different probability cut-offs points, from 0 to 1 by increments of 0.001.

numbers  = [float(x)/1000 for x in range(1000)]
for i in numbers:
    preds_df[i]= preds_df.prob_accept_deposit.map(lambda x: 1 if x > i else 0)
preds_df.head(5)
   prob_accept_deposit  0.0  0.001  0.002  ...  0.996  0.997  0.998  0.999
0             0.041983    1      1      1  ...      0      0      0      0
1             0.036563    1      1      1  ...      0      0      0      0
2             0.770255    1      1      1  ...      0      0      0      0
3             0.057986    1      1      1  ...      0      0      0      0
4             0.008412    1      1      1  ...      0      0      0      0

[5 rows x 1001 columns]

Then we calculate the metrics: accuracy sensitivity, and recalls for deposit and no deposit for various probability cutoffs.

cutoff_df = pd.DataFrame( columns = ['prob','accs','def_recalls','nondef_recalls'])
for i in numbers:
    cm1 = metrics.confusion_matrix(true_df, preds_df[i])
    total1=sum(sum(cm1))
    accs = (cm1[0][0]+cm1[1][1])/total1
    
    def_recalls = cm1[1][1]/(cm1[1][1]+cm1[1][0])
    nondef_recalls = cm1[0][0]/(cm1[0][0]+cm1[0][1])
    cutoff_df.loc[i] =[ i ,accs,def_recalls,nondef_recalls]
print(cutoff_df.head(5))
        prob      accs  def_recalls  nondef_recalls
0.000  0.000  0.110392          1.0        0.000000
0.001  0.001  0.110392          1.0        0.000000
0.002  0.002  0.110392          1.0        0.000000
0.003  0.003  0.110877          1.0        0.000546
0.004  0.004  0.111363          1.0        0.001092

Now we are able to choose the best cut-off based on the trade-off between deposit recall and no-deposit recall.

cutoff_df <- py$cutoff_df

names(cutoff_df)[names(cutoff_df) == "prob"] <- "Probability cut-off"
names(cutoff_df)[names(cutoff_df) == "accs"] <- "Accuracy"
names(cutoff_df)[names(cutoff_df) == "def_recalls"] <- "Deposit Recall"
names(cutoff_df)[names(cutoff_df) == "nondef_recalls"] <- "No-deposit Recall"

cutoff_df <- cutoff_df %>% gather(key = "metric", value = "value", -`Probability cut-off`)

ggplot(cutoff_df, aes(x = `Probability cut-off`, y = value, color = metric)) + 
    geom_line(aes(linetype = metric)) +
    ggtitle("Accuracy, Deposit Recall and No-deposit Recall") +
    scale_fill_discrete(name = "Metrics", labels = c("Accuracy", "Deposit Recall", "No-deposit Recall"))

The optimal cut-off point is the following:

cutoff_df["diff"] = abs(cutoff_df.def_recalls - cutoff_df.nondef_recalls)
best_threshold = cutoff_df["prob"].loc[cutoff_df["diff"] == min(cutoff_df["diff"])]
best_threshold = best_threshold.iloc[0]
print(best_threshold)
0.541

Now we can implement the optimal threshold to the model. Again, we calculate the probability predictions from the model, then we create a dataframe with such predictions.

preds = clf_logistic.predict_proba(X_eval)
preds_df = pd.DataFrame(preds[:,1], columns = ['prob_accept_deposit'])
true_df = y_eval

Then we reassign the probability of accepting a deposit based on the optimal threshold.

preds_df['prob_accept_deposit'] = preds_df['prob_accept_deposit'].apply(lambda x: 1 if x > best_threshold else 0)

We can slightly compare how differ the estimations with the real values by the difference in deposits.

Count of estimated deposits by the logistic model:

print(preds_df['prob_accept_deposit'].value_counts())
0    4870
1    1308
Name: prob_accept_deposit, dtype: int64

Count of real deposits in our test set:

print(true_df['deposit'].value_counts())
0    5496
1     682
Name: deposit, dtype: int64

For further information it is necessary to check the classification report:

target_names = ['No-deposit', 'Deposit']
print(classification_report(y_eval, preds_df['prob_accept_deposit'], target_names=target_names))
              precision    recall  f1-score   support

  No-deposit       0.98      0.87      0.92      5496
     Deposit       0.45      0.87      0.60       682

    accuracy                           0.87      6178
   macro avg       0.72      0.87      0.76      6178
weighted avg       0.92      0.87      0.89      6178

By setting this new cut-off our recall metric is balanced in both classes and the model improves the correctness of classification in each of the classes.

We check the confusion matrix and compare with the previous one

# Print the confusion matrix
matrix_2 = confusion_matrix(y_eval,preds_df['prob_accept_deposit'])
print(matrix_2)
[[4781  715]
 [  89  593]]

TN = 4778

TP = 593

FN = 89

FP = 718

Now we check the accuracy after assigning the values with the new cut-off point.

accuracy_log_reg_1 = round((matrix_2[0][0]+matrix_2[1][1])/sum(sum(matrix_2)), 3)
print(accuracy_log_reg_1)
0.87

There is a considerable improvement in the accuracy.

We are interested to evaluate our model based on the metric recall aka true positive rate for subscribing to a term deposits:

recall_deposit_log_reg_1 = round(matrix_2[1][1]/(matrix_2[1][1]+matrix_2[1][0]), 2)
print(recall_deposit_log_reg_1)
0.87

Now we proceed to calculate the Area Under Curve (AUC) score: stands for “Area under the ROC Curve.” The AUC measures the entire two-dimensional area underneath the entire ROC curve and allows comparison of classifiers by comparing the total area underneath the line produced on the ROC curve. AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0, and one whose predictions are 100% correct has an AUC of 1.0.

The Receiver Operating Characteristic (ROC) Curve: is a two-dimensional graph that depicts trade-offs between benefits (true positives) and costs (false positives). It displays a relation between sensitivity and specificity for a given classifier. The TPR (True Positive Rate) is plot on the Y axis and the FPR (False Positive Rate) on the X axis, where the TPR is the percentage of true positives (to the sum of true positives and false negatives and the FPR is the percentage of false positives (to the sum of false positives and true negatives. A ROC curve examines a single classifier over a set of classification thresholds.

prob_deposit_log_reg_1 = preds[:, 1]
auc_log_reg_1 = round(roc_auc_score(y_eval, prob_deposit_log_reg_1), 3)
print(auc_log_reg_1)
0.94

Regularized Logistic Regression Model

In this section we use the same algorithm (logistic regression), however, this time we use regularization techniques. Regularization techniques work by limiting the capacity of models (such as logistic regression) by adding a parameter norm penalty \(\\lambda\) to the objective function. As follows:

Generally we trade off some bias to get lower variance, and lower variance estimators tend to overfit less. However, our Ridge Regression (aka L2-norm penalty) is an assumption about the function we’re fitting (we’re assuming that it has a small gradient). In general, when we trade off bias for lower variance, it’s because we’re biasing towards the kind of functions we want to fit.

Our logistic regression model uses the optimisation algorithm Stochastic Average Gradient (SAG), also we set max_iter = 10000 (a large number) to allow convergence of the estimates.

clf_logistic2 = LogisticRegression(solver='sag', max_iter = 10000, penalty = 'l2').fit(X_train, np.ravel(y_train))

As in the above section we make predictions using the evaluation dataset.

preds = clf_logistic2.predict_proba(X_eval)

These predictions are stored in a dataframe instead of an array.

preds_df = pd.DataFrame(preds[:,1], columns = ['prob_accept_deposit'])
true_df = y_eval

Based on the same approach for selecting the best cut-off point, we implement the same algorithm to find the optimal probability cut-off point in order to balance our recall metric. Again we try to classify the probabilities using different cut-off points from 0 to 1 by increments of 0.001

numbers  = [float(x)/1000 for x in range(1000)]
for i in numbers:
    preds_df[i]= preds_df.prob_accept_deposit.map(lambda x: 1 if x > i else 0)
preds_df.head(5)
   prob_accept_deposit  0.0  0.001  0.002  ...  0.996  0.997  0.998  0.999
0             0.069810    1      1      1  ...      0      0      0      0
1             0.061188    1      1      1  ...      0      0      0      0
2             0.796112    1      1      1  ...      0      0      0      0
3             0.066158    1      1      1  ...      0      0      0      0
4             0.021530    1      1      1  ...      0      0      0      0

[5 rows x 1001 columns]

Then we calculate the metrics: accuracy sensitivity, and recalls for deposit and no deposit for various probability cutoffs.

cutoff_df = pd.DataFrame( columns = ['prob','accs','def_recalls','nondef_recalls'])
for i in numbers:
    cm1 = metrics.confusion_matrix(true_df, preds_df[i])
    total1=sum(sum(cm1))
    accs = (cm1[0][0]+cm1[1][1])/total1
    
    def_recalls = cm1[1][1]/(cm1[1][1]+cm1[1][0])
    nondef_recalls = cm1[0][0]/(cm1[0][0]+cm1[0][1])
    cutoff_df.loc[i] =[ i ,accs,def_recalls,nondef_recalls]
print(cutoff_df.head(5))
        prob      accs  def_recalls  nondef_recalls
0.000  0.000  0.110392          1.0        0.000000
0.001  0.001  0.110392          1.0        0.000000
0.002  0.002  0.110392          1.0        0.000000
0.003  0.003  0.110392          1.0        0.000000
0.004  0.004  0.110554          1.0        0.000182

Now we are able to choose the best cut-off based on the trade-off between deposit recall and no-deposit recall.

cutoff_df <- py$cutoff_df

names(cutoff_df)[names(cutoff_df) == "prob"] <- "Probability cut-off"
names(cutoff_df)[names(cutoff_df) == "accs"] <- "Accuracy"
names(cutoff_df)[names(cutoff_df) == "def_recalls"] <- "Deposit Recall"
names(cutoff_df)[names(cutoff_df) == "nondef_recalls"] <- "No-deposit Recall"

cutoff_df <- cutoff_df %>% gather(key = "metric", value = "value", -`Probability cut-off`)

ggplot(cutoff_df, aes(x = `Probability cut-off`, y = value, color = metric)) + 
    geom_line(aes(linetype = metric)) +
    ggtitle("Accuracy, Deposit Recall and No-deposit Recall") +
    scale_fill_discrete(name = "Metrics", labels = c("Accuracy", "Deposit Recall", "No-deposit Recall"))

The optimal cut-off point is the following:

cutoff_df["diff"] = abs(cutoff_df.def_recalls - cutoff_df.nondef_recalls)
best_threshold = cutoff_df["prob"].loc[cutoff_df["diff"] == min(cutoff_df["diff"])]
best_threshold = best_threshold.iloc[0]
print(best_threshold)
0.518

Then we reassign the probability of accepting a deposit based on the optimal threshold.

preds_df['prob_accept_deposit'] = preds_df['prob_accept_deposit'].apply(lambda x: 1 if x > best_threshold else 0)

We can slightly compare how differ the estimations with the real values by the difference in deposits.

Count of estimated deposits by the logistic model:

print(preds_df['prob_accept_deposit'].value_counts())
0    4805
1    1373
Name: prob_accept_deposit, dtype: int64

Count of real deposits in our test set:

print(true_df['deposit'].value_counts())
0    5496
1     682
Name: deposit, dtype: int64

For further information it is necessary to check the classification report:

target_names = ['No-deposit', 'Deposit']
print(classification_report(y_eval, preds_df['prob_accept_deposit'], target_names=target_names))
              precision    recall  f1-score   support

  No-deposit       0.98      0.86      0.91      5496
     Deposit       0.43      0.86      0.57       682

    accuracy                           0.86      6178
   macro avg       0.70      0.86      0.74      6178
weighted avg       0.92      0.86      0.88      6178

We can see now a more balanced recall metric, however the values are lower than the previous one. We check the accuracy score the model as follows:

Finally, we check the confusion matrix

# Print the confusion matrix
matrix_3 = confusion_matrix(y_eval,preds_df['prob_accept_deposit'])
print(matrix_3)
[[4707  789]
 [  98  584]]

Now we check the accuracy of the model after assigning the optimal probability cut-off point:

accuracy_log_reg_2 = round((matrix_3[0][0]+matrix_3[1][1])/sum(sum(matrix_3)), 3)
print(accuracy_log_reg_2)
0.856

The accuracy of this model is lower than the previous one.

We are interested in evaluating our model based on the metric recall aka true positive rate for subscribing to a term deposits:

recall_deposit_log_reg_2 = round(matrix_3[1][1]/(matrix_3[1][1]+matrix_3[1][0]), 3)
print(recall_deposit_log_reg_2)
0.856

We can see a considerable improvement of 1% difference

#AUC
prob_deposit_log_reg_2 = preds[:, 1]
auc_log_reg_2 = round(roc_auc_score(y_eval, prob_deposit_log_reg_2), 3)
print(auc_log_reg_2)
0.927

So far our first model yields better returns in terms of accuracy, recall and AUC. This model however, since it is penalised, the estimates have been affected. Therefore we can assume that the first model has less risk of overfitting.

https://www.kaggle.com/janiobachmann/bank-marketing-campaign-opening-a-term-deposit/comments

Reduced Logistic Regression Model

Our dataset has 63 independent variables, and many of these do not impact on the target variable. Many of this variables are called, noisy data. The occurrences of noisy data in data set can significantly impact prediction of any meaningful information. Many empirical studies have shown that noise in data set dramatically led to decreased classification accuracy and poor prediction results (Gupta, S. and Gupta, A., 2019)

In order to eliminate the noisy data in our training dataset, we will use the Recursive Feature Elimination (RFE) method which is a feature selection approach. It works by recursively removing attributes and building a model on those attributes that remain. It uses the model accuracy to identify which attributes (and combination of attributes) contribute the most to predicting the target attribute.

Our goal is not reduce our data to 1/3. Up to 20 variables.

logreg = LogisticRegression(max_iter = 10000)
rfe = RFE(logreg, 20)
rfe = rfe.fit(X_train, y_train)
print(list(zip(X_train.columns, rfe.support_, rfe.ranking_)))
[('age', False, 30), ('duration', False, 42), ('campaign', False, 21), ('pdays', False, 10), ('previous', True, 1), ('emp.var.rate', False, 31), ('cons.price.idx', False, 19), ('cons.conf.idx', False, 18), ('euribor3m', True, 1), ('nr.employed', False, 35), ('job_admin.', False, 13), ('job_blue-collar', True, 1), ('job_entrepreneur', False, 12), ('job_housemaid', True, 1), ('job_management', False, 16), ('job_retired', True, 1), ('job_self-employed', False, 32), ('job_services', True, 1), ('job_student', True, 1), ('job_technician', False, 23), ('job_unemployed', False, 38), ('job_unknown', False, 2), ('marital_divorced', False, 15), ('marital_married', False, 29), ('marital_single', False, 17), ('marital_unknown', False, 41), ('education_basic.4y', False, 11), ('education_basic.6y', True, 1), ('education_basic.9y', False, 39), ('education_high.school', False, 6), ('education_illiterate', False, 36), ('education_professional.course', False, 22), ('education_university.degree', False, 37), ('education_unknown', False, 28), ('default_no', False, 27), ('default_unknown', True, 1), ('default_yes', False, 44), ('housing_no', False, 7), ('housing_unknown', True, 1), ('housing_yes', False, 20), ('loan_no', False, 33), ('loan_unknown', False, 9), ('loan_yes', False, 40), ('contact_cellular', False, 5), ('contact_telephone', False, 14), ('month_apr', True, 1), ('month_aug', True, 1), ('month_dec', True, 1), ('month_jul', False, 43), ('month_jun', False, 34), ('month_mar', True, 1), ('month_may', True, 1), ('month_nov', True, 1), ('month_oct', True, 1), ('month_sep', False, 3), ('day_of_week_fri', False, 24), ('day_of_week_mon', False, 8), ('day_of_week_thu', False, 25), ('day_of_week_tue', False, 26), ('day_of_week_wed', False, 4), ('poutcome_failure', True, 1), ('poutcome_nonexistent', True, 1), ('poutcome_success', True, 1)]

The variables having showing True are the ones we are interested in. Now we select the variables we are interested in by running the following chunk of code:

col = X_train.columns[rfe.support_]
X_train_reduced = X_train[col]
X_eval_reduced = X_eval[col]

Now we can train our model with our training data with 19 variables.

clf_logistic3 = LogisticRegression(max_iter = 100000).fit(X_train_reduced, np.ravel(y_train))

As we the model 1, we will make predictions and look for the optimal cut-off point for classification.

preds = clf_logistic3.predict_proba(X_eval_reduced)
preds_df = pd.DataFrame(preds[:,1], columns = ['prob_accept_deposit'])
true_df = y_eval

We try to classify the probabilities using different cut-off points from 0 to 1 by increments of 0.001

numbers  = [float(x)/1000 for x in range(1000)]
for i in numbers:
    preds_df[i]= preds_df.prob_accept_deposit.map(lambda x: 1 if x > i else 0)
preds_df.head(5)
   prob_accept_deposit  0.0  0.001  0.002  ...  0.996  0.997  0.998  0.999
0             0.265205    1      1      1  ...      0      0      0      0
1             0.250153    1      1      1  ...      0      0      0      0
2             0.890144    1      1      1  ...      0      0      0      0
3             0.358171    1      1      1  ...      0      0      0      0
4             0.267129    1      1      1  ...      0      0      0      0

[5 rows x 1001 columns]

We create as many confusion matrices as cut-off and calculate the accuracy and recalls for each confusion matrix cut-off.

cutoff_df = pd.DataFrame( columns = ['prob','accs','def_recalls','nondef_recalls'])
for i in numbers:
    cm1 = metrics.confusion_matrix(true_df, preds_df[i])
    total1=sum(sum(cm1))
    accs = (cm1[0][0]+cm1[1][1])/total1
    
    def_recalls = cm1[1][1]/(cm1[1][1]+cm1[1][0])
    nondef_recalls = cm1[0][0]/(cm1[0][0]+cm1[0][1])
    cutoff_df.loc[i] =[ i ,accs,def_recalls,nondef_recalls]
print(cutoff_df.head(5))
        prob      accs  def_recalls  nondef_recalls
0.000  0.000  0.110392          1.0             0.0
0.001  0.001  0.110392          1.0             0.0
0.002  0.002  0.110392          1.0             0.0
0.003  0.003  0.110392          1.0             0.0
0.004  0.004  0.110392          1.0             0.0
cutoff_df <- py$cutoff_df

names(cutoff_df)[names(cutoff_df) == "prob"] <- "Probability cut-off"
names(cutoff_df)[names(cutoff_df) == "accs"] <- "Accuracy"
names(cutoff_df)[names(cutoff_df) == "def_recalls"] <- "Deposit Recall"
names(cutoff_df)[names(cutoff_df) == "nondef_recalls"] <- "No-deposit Recall"

cutoff_df <- cutoff_df %>% gather(key = "metric", value = "value", -`Probability cut-off`)

ggplot(cutoff_df, aes(x = `Probability cut-off`, y = value, color = metric)) + 
    geom_line(aes(linetype = metric)) +
    ggtitle("Accuracy, Deposit Recall and No-deposit Recall") +
    scale_fill_discrete(name = "Metrics", labels = c("Accuracy", "Deposit Recall", "No-deposit Recall"))

With this information we are able to choose the best cut-off point:

cutoff_df["diff"] = abs(cutoff_df.def_recalls - cutoff_df.nondef_recalls)
best_threshold = cutoff_df["prob"].loc[cutoff_df["diff"] == min(cutoff_df["diff"])]
best_threshold = best_threshold.iloc[0]
print(best_threshold)
0.427

We can proceed to classify our model with the best cut-off point:

preds = clf_logistic3.predict_proba(X_eval_reduced)
preds_df = pd.DataFrame(preds[:,1], columns = ['prob_accept_deposit'])
true_df = y_eval

We implement the cut-off point in the discriminatory process

preds_df['prob_accept_deposit'] = preds_df['prob_accept_deposit'].apply(lambda x: 1 if x > best_threshold else 0)

After that, we can see the results from the classification report:

target_names = ['No-deposit', 'Deposit']
print(classification_report(y_eval, preds_df['prob_accept_deposit'], target_names=target_names))
              precision    recall  f1-score   support

  No-deposit       0.96      0.73      0.83      5496
     Deposit       0.25      0.73      0.37       682

    accuracy                           0.73      6178
   macro avg       0.60      0.73      0.60      6178
weighted avg       0.88      0.73      0.78      6178

We analyse the confusion matrix of this model:

matrix_4 = confusion_matrix(y_eval,preds_df['prob_accept_deposit'])
print(matrix_4)
[[3993 1503]
 [ 186  496]]

We calculate the accuracy of the model:

accuracy_log_reg_3 = round((matrix_4[0][0]+matrix_4[1][1])/sum(sum(matrix_4)), 3)
print(accuracy_log_reg_3)
0.727

Now we proceed to calculate the recall for deposits

recall_deposit_log_reg_3 = round(matrix_4[1][1]/(matrix_4[1][1]+matrix_4[1][0]), 2)
print(recall_deposit_log_reg_3)
0.73

Finally, we calculate the AUC for this model:

prob_deposit_log_reg_3 = preds[:, 1]
auc_log_reg_3 = round(roc_auc_score(y_eval, prob_deposit_log_reg_3), 3)
print(auc_log_reg_3)
0.789

Logistic regression models’ results

data = {'Model': ['Logistic Regression Model 1', 'Regularized Logistic Regression Model', 'Reduced Logistic Regression Model'], 
        'Accuracy': [accuracy_log_reg_1, accuracy_log_reg_2, accuracy_log_reg_3],
        'Recall': [recall_deposit_log_reg_1, recall_deposit_log_reg_2, recall_deposit_log_reg_3],
        'AUC': [auc_log_reg_1, auc_log_reg_2, auc_log_reg_3]
        } 
comparison = pd.DataFrame(data) 
print(comparison)
                                   Model  Accuracy  Recall    AUC
0            Logistic Regression Model 1     0.870   0.870  0.940
1  Regularized Logistic Regression Model     0.856   0.856  0.927
2      Reduced Logistic Regression Model     0.727   0.730  0.789

Gradient Boosting Trees Model

Train a model

clf_gbt = xgb.XGBClassifier(use_label_encoder=False).fit(X_train, np.ravel(y_train))
[20:26:44] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.

Based on the trained model, we predict the probability that a customer has to subscribing to a term deposits using validation data.

preds = clf_gbt.predict_proba(X_eval)
preds_df = pd.DataFrame(preds[:,1], columns = ['prob_accept_deposit'])
true_df = y_eval
pred_comparison = pd.concat([true_df.reset_index(drop = True), preds_df], axis = 1)
pred_comparison.head(10)
   deposit  prob_accept_deposit
0        0             0.001469
1        0             0.005338
2        0             0.994247
3        0             0.000274
4        0             0.000237
5        0             0.561475
6        0             0.000339
7        0             0.000023
8        0             0.001409
9        1             0.010997
numbers  = [float(x)/1000 for x in range(1000)]
for i in numbers:
    preds_df[i]= preds_df.prob_accept_deposit.map(lambda x: 1 if x > i else 0)
preds_df.head(5)
   prob_accept_deposit  0.0  0.001  0.002  ...  0.996  0.997  0.998  0.999
0             0.001469    1      1      0  ...      0      0      0      0
1             0.005338    1      1      1  ...      0      0      0      0
2             0.994247    1      1      1  ...      0      0      0      0
3             0.000274    1      0      0  ...      0      0      0      0
4             0.000237    1      0      0  ...      0      0      0      0

[5 rows x 1001 columns]
cutoff_df = pd.DataFrame( columns = ['prob','accs','def_recalls','nondef_recalls'])
for i in numbers:
    cm1 = metrics.confusion_matrix(true_df, preds_df[i])
    total1=sum(sum(cm1))
    accs = (cm1[0][0]+cm1[1][1])/total1
    
    def_recalls = cm1[1][1]/(cm1[1][1]+cm1[1][0])
    nondef_recalls = cm1[0][0]/(cm1[0][0]+cm1[0][1])
    cutoff_df.loc[i] =[ i ,accs,def_recalls,nondef_recalls]
print(cutoff_df.head(5))
        prob      accs  def_recalls  nondef_recalls
0.000  0.000  0.110392          1.0        0.000000
0.001  0.001  0.357073          1.0        0.277293
0.002  0.002  0.431693          1.0        0.361172
0.003  0.003  0.471512          1.0        0.405932
0.004  0.004  0.502266          1.0        0.440502
cutoff_df <- py$cutoff_df

names(cutoff_df)[names(cutoff_df) == "prob"] <- "Probability cut-off"
names(cutoff_df)[names(cutoff_df) == "accs"] <- "Accuracy"
names(cutoff_df)[names(cutoff_df) == "def_recalls"] <- "Deposit Recall"
names(cutoff_df)[names(cutoff_df) == "nondef_recalls"] <- "No-deposit Recall"

cutoff_df <- cutoff_df %>% gather(key = "metric", value = "value", -`Probability cut-off`)

ggplot(cutoff_df, aes(x = `Probability cut-off`, y = value, color = metric)) + 
    geom_line(aes(linetype = metric)) +
    ggtitle("Accuracy, Deposit Recall and No-deposit Recall") +
    scale_fill_discrete(name = "Metrics", labels = c("Accuracy", "Deposit Recall", "No-deposit Recall"))

cutoff_df["diff"] = abs(cutoff_df.def_recalls - cutoff_df.nondef_recalls)
best_threshold = cutoff_df["prob"].loc[cutoff_df["diff"] == min(cutoff_df["diff"])]
best_threshold = best_threshold.iloc[0]
print(best_threshold)
0.654
preds_df['prob_accept_deposit'] = preds_df['prob_accept_deposit'].apply(lambda x: 1 if x > best_threshold else 0)

After that, we can see the results from the classification report:

target_names = ['No-deposit', 'Deposit']
print(classification_report(y_eval, preds_df['prob_accept_deposit'], target_names=target_names))
              precision    recall  f1-score   support

  No-deposit       0.98      0.88      0.93      5496
     Deposit       0.48      0.88      0.62       682

    accuracy                           0.88      6178
   macro avg       0.73      0.88      0.78      6178
weighted avg       0.93      0.88      0.90      6178

We analyse the confusion matrix of this model:

matrix_5 = confusion_matrix(y_eval,preds_df['prob_accept_deposit'])
print(matrix_5)
[[4852  644]
 [  80  602]]

We calculate the accuracy of the model:

accuracy_XGB_1 = round((matrix_5[0][0]+matrix_5[1][1])/sum(sum(matrix_5)), 3)
print(accuracy_XGB_1)
0.883

Now we proceed to calculate the recall for deposits

recall_XGB_1 = round(matrix_5[1][1]/(matrix_5[1][1]+matrix_5[1][0]), 2)
print(recall_XGB_1)
0.88

Finally, we calculate the AUC for this model:

prob_deposit_xgb_1 = preds[:, 1]
auc_XGB_1 = round(roc_auc_score(y_eval, prob_deposit_xgb_1), 3)
print(auc_XGB_1)
0.944

Reduced Gradient Boosting Trees Model

Create and train the model on the training data with 63 features

clf_gbt2 = xgb.XGBClassifier(use_label_encoder=False).fit(X_train,np.ravel(y_train))

Print the column importances from the model

var_importance = clf_gbt2.get_booster().get_score(importance_type = 'weight')
var_importance_df = pd.DataFrame(var_importance, index = [1])
print(var_importance)
{'duration': 784, 'nr.employed': 29, 'poutcome_success': 10, 'education_high.school': 30, 'emp.var.rate': 61, 'poutcome_failure': 27, 'age': 397, 'cons.conf.idx': 84, 'job_admin.': 41, 'day_of_week_fri': 25, 'job_housemaid': 7, 'cons.price.idx': 94, 'euribor3m': 384, 'month_oct': 17, 'pdays': 38, 'job_self-employed': 9, 'job_blue-collar': 27, 'default_no': 24, 'campaign': 164, 'previous': 62, 'housing_unknown': 4, 'education_basic.9y': 17, 'day_of_week_wed': 23, 'contact_cellular': 27, 'education_university.degree': 57, 'day_of_week_thu': 47, 'day_of_week_tue': 41, 'month_nov': 13, 'education_unknown': 8, 'job_services': 12, 'housing_yes': 36, 'day_of_week_mon': 37, 'education_professional.course': 28, 'job_student': 9, 'loan_no': 15, 'month_aug': 7, 'job_technician': 29, 'job_management': 11, 'loan_yes': 14, 'month_may': 21, 'month_apr': 8, 'month_jul': 9, 'marital_single': 36, 'housing_no': 46, 'month_mar': 17, 'marital_married': 27, 'job_entrepreneur': 4, 'job_retired': 8, 'month_sep': 9, 'education_basic.6y': 7, 'marital_divorced': 9, 'education_basic.4y': 14, 'month_jun': 5, 'month_dec': 2, 'job_unknown': 3, 'job_unemployed': 1}

Visualisation of best variables

var_importance <- py$var_importance_df
var_importance <- as.data.frame(t(var_importance))
names(var_importance)[1] <- "importance"
var_importance <- tibble::rownames_to_column(var_importance, "variables")
# make importances relative to max importance
var_importance <- var_importance[order(-var_importance$importance),]
var_importance$importance <- 100*var_importance$importance/max(var_importance$importance)


fig <- plotly::plot_ly( data = var_importance,
  x = ~importance,
  y = ~reorder(variables, importance),
  name = "Variable Importance",
  type = "bar", 
  orientation = 'h') %>% 
    plotly::layout(
        barmode = "stack",
        hovermode = "compare",
        yaxis = list(title = "Variable"),
        xaxis = list(title = "Variable Importance")
        )

fig

Filter the X_train dataset with best variables. We set to only filter the variables that have an importance higher or equal than 10%

var_importance_df = r.var_importance
col_names = var_importance_df.variables[var_importance_df["importance"] >= 10]

X_train_reduced = X_train[col_names]
X_eval_reduced = X_eval[col_names]

Create and train the model on the training data

clf_gbt2 = xgb.XGBClassifier(use_label_encoder=False).fit(X_train_reduced,np.ravel(y_train))

OIptimal CUT OFF SEARCH

preds = clf_gbt2.predict_proba(X_eval_reduced)
preds_df = pd.DataFrame(preds[:,1], columns = ['prob_accept_deposit']).reset_index(drop = True)
true_df = y_eval
numbers  = [float(x)/1000 for x in range(1000)]
for i in numbers:
    preds_df[i]= preds_df.prob_accept_deposit.map(lambda x: 1 if x > i else 0)
preds_df.head(5)
   prob_accept_deposit  0.0  0.001  0.002  ...  0.996  0.997  0.998  0.999
0             0.003237    1      1      1  ...      0      0      0      0
1             0.003548    1      1      1  ...      0      0      0      0
2             0.947926    1      1      1  ...      0      0      0      0
3             0.001076    1      1      0  ...      0      0      0      0
4             0.000189    1      0      0  ...      0      0      0      0

[5 rows x 1001 columns]
cutoff_df = pd.DataFrame( columns = ['prob','accs','def_recalls','nondef_recalls'])
for i in numbers:
    cm1 = metrics.confusion_matrix(true_df, preds_df[i])
    total1=sum(sum(cm1))
    accs = (cm1[0][0]+cm1[1][1])/total1
    
    def_recalls = cm1[1][1]/(cm1[1][1]+cm1[1][0])
    nondef_recalls = cm1[0][0]/(cm1[0][0]+cm1[0][1])
    cutoff_df.loc[i] =[ i ,accs,def_recalls,nondef_recalls]
print(cutoff_df.head(5))
        prob      accs  def_recalls  nondef_recalls
0.000  0.000  0.110392          1.0        0.000000
0.001  0.001  0.310133          1.0        0.224527
0.002  0.002  0.384752          1.0        0.308406
0.003  0.003  0.432826          1.0        0.362445
0.004  0.004  0.469893          1.0        0.404112
cutoff_df <- py$cutoff_df

names(cutoff_df)[names(cutoff_df) == "prob"] <- "Probability cut-off"
names(cutoff_df)[names(cutoff_df) == "accs"] <- "Accuracy"
names(cutoff_df)[names(cutoff_df) == "def_recalls"] <- "Deposit Recall"
names(cutoff_df)[names(cutoff_df) == "nondef_recalls"] <- "No-deposit Recall"

cutoff_df <- cutoff_df %>% gather(key = "metric", value = "value", -`Probability cut-off`)

ggplot(cutoff_df, aes(x = `Probability cut-off`, y = value, color = metric)) + 
    geom_line(aes(linetype = metric)) +
    ggtitle("Accuracy, Deposit Recall and No-deposit Recall") +
    scale_fill_discrete(name = "Metrics", labels = c("Accuracy", "Deposit Recall", "No-deposit Recall"))

cutoff_df["diff"] = abs(cutoff_df.def_recalls - cutoff_df.nondef_recalls)
best_threshold = cutoff_df["prob"].loc[cutoff_df["diff"] == min(cutoff_df["diff"])]
best_threshold = best_threshold.iloc[0]
print(best_threshold)
0.634

Predict with a model

preds = clf_gbt2.predict_proba(X_eval_reduced)
preds_df = pd.DataFrame(preds[:,1], columns = ['prob_accept_deposit'])
true_df = y_eval
preds_df['prob_accept_deposit'] = preds_df['prob_accept_deposit'].apply(lambda x: 1 if x > best_threshold else 0)
target_names = ['No-deposit', 'Deposit']
print(classification_report(y_eval, preds_df['prob_accept_deposit'], target_names=target_names))
              precision    recall  f1-score   support

  No-deposit       0.98      0.88      0.93      5496
     Deposit       0.48      0.88      0.62       682

    accuracy                           0.88      6178
   macro avg       0.73      0.88      0.77      6178
weighted avg       0.93      0.88      0.89      6178
matrix_6 = confusion_matrix(y_eval,preds_df['prob_accept_deposit'])
print(matrix_6)
[[4836  660]
 [  82  600]]
accuracy_XGB_2 = round((matrix_6[0][0]+matrix_6[1][1])/sum(sum(matrix_6)), 3)
print(accuracy_XGB_2)
0.88
recall_XGB_2 = round(matrix_6[1][1]/(matrix_6[1][1]+matrix_6[1][0]), 2)
print(recall_XGB_2)
0.88
prob_deposit_xgb_2 = preds[:, 1]
auc_XGB_2 = round(roc_auc_score(y_eval, prob_deposit_xgb_2), 3)
print(auc_XGB_2)
0.942

Cross Validated Gradient Boosting Trees Model

Create a gradient boosted tree model using two hyperparameters

clf_gbt3 = xgb.XGBClassifier(learning_rate = 0.1, max_depth = 7)

Calculate the cross validation scores for 10 folds

cv_scores = cross_val_score(clf_gbt3, X_train, np.ravel(y_train), cv = 10)
print(cv_scores)

Print the average accuracy and standard deviation of the scores

print("Average accuracy: %0.2f (+/- %0.2f)" % (cv_scores.mean(),
                                              cv_scores.std() * 2))
Average accuracy: 0.89 (+/- 0.02)

OIptimal CUT OFF SEARCH

preds = cross_val_predict(clf_gbt3, X_eval, np.ravel(y_eval), cv=10, method = 'predict_proba')
preds_df = pd.DataFrame(preds[:,1], columns = ['prob_accept_deposit']).reset_index(drop = True)
true_df = y_eval
numbers  = [float(x)/1000 for x in range(1000)]
for i in numbers:
    preds_df[i]= preds_df.prob_accept_deposit.map(lambda x: 1 if x > i else 0)
preds_df.head(5)
   prob_accept_deposit  0.0  0.001  0.002  ...  0.996  0.997  0.998  0.999
0             0.001045    1      1      0  ...      0      0      0      0
1             0.000750    1      0      0  ...      0      0      0      0
2             0.598424    1      1      1  ...      0      0      0      0
3             0.001137    1      1      0  ...      0      0      0      0
4             0.000306    1      0      0  ...      0      0      0      0

[5 rows x 1001 columns]
cutoff_df = pd.DataFrame( columns = ['prob','accs','def_recalls','nondef_recalls'])
for i in numbers:
    cm1 = metrics.confusion_matrix(true_df, preds_df[i])
    total1=sum(sum(cm1))
    accs = (cm1[0][0]+cm1[1][1])/total1
    
    def_recalls = cm1[1][1]/(cm1[1][1]+cm1[1][0])
    nondef_recalls = cm1[0][0]/(cm1[0][0]+cm1[0][1])
    cutoff_df.loc[i] =[ i ,accs,def_recalls,nondef_recalls]
print(cutoff_df.head(5))
        prob      accs  def_recalls  nondef_recalls
0.000  0.000  0.110392     1.000000        0.000000
0.001  0.001  0.465847     1.000000        0.399563
0.002  0.002  0.586112     1.000000        0.534753
0.003  0.003  0.627226     1.000000        0.580968
0.004  0.004  0.651505     0.997067        0.608624
cutoff_df <- py$cutoff_df

names(cutoff_df)[names(cutoff_df) == "prob"] <- "Probability cut-off"
names(cutoff_df)[names(cutoff_df) == "accs"] <- "Accuracy"
names(cutoff_df)[names(cutoff_df) == "def_recalls"] <- "Deposit Recall"
names(cutoff_df)[names(cutoff_df) == "nondef_recalls"] <- "No-deposit Recall"

cutoff_df <- cutoff_df %>% gather(key = "metric", value = "value", -`Probability cut-off`)

ggplot(cutoff_df, aes(x = `Probability cut-off`, y = value, color = metric)) + 
    geom_line(aes(linetype = metric)) +
    ggtitle("Accuracy, Deposit Recall and No-deposit Recall") +
    scale_fill_discrete(name = "Metrics", labels = c("Accuracy", "Deposit Recall", "No-deposit Recall"))

cutoff_df["diff"] = abs(cutoff_df.def_recalls - cutoff_df.nondef_recalls)
best_threshold = cutoff_df["prob"].loc[cutoff_df["diff"] == min(cutoff_df["diff"])]
best_threshold = best_threshold.iloc[0]
print(best_threshold)
0.117

Predict with a model

preds = cross_val_predict(clf_gbt3, X_eval, np.ravel(y_eval), cv=10, method = 'predict_proba')
preds_df = pd.DataFrame(preds[:,1], columns = ['prob_accept_deposit']).reset_index(drop = True)
true_df = y_eval
preds_df['prob_accept_deposit'] = preds_df['prob_accept_deposit'].apply(lambda x: 1 if x > best_threshold else 0)
target_names = ['No-deposit', 'Deposit']
print(classification_report(true_df, preds_df['prob_accept_deposit'], target_names=target_names))
              precision    recall  f1-score   support

  No-deposit       0.98      0.88      0.93      5496
     Deposit       0.47      0.88      0.61       682

    accuracy                           0.88      6178
   macro avg       0.72      0.88      0.77      6178
weighted avg       0.93      0.88      0.89      6178
matrix_7 = confusion_matrix(true_df,preds_df['prob_accept_deposit'])
print(matrix_7)
[[4814  682]
 [  85  597]]
accuracy_XGB_3 = round((matrix_7[0][0]+matrix_7[1][1])/sum(sum(matrix_7)), 3)
print(accuracy_XGB_3)
0.876
recall_XGB_3 = round(matrix_7[1][1]/(matrix_7[1][1]+matrix_7[1][0]), 2)
print(recall_XGB_3)
0.88
prob_deposit_xgb_3 = preds[:, 1]
auc_XGB_3 = round(roc_auc_score(true_df, prob_deposit_xgb_3), 3)
print(auc_XGB_3)
0.939

Gradient Boosting Trees models’ results:

data = {'Model': ['Gradient Boosting Trees Model 1', 'Reduced Gradient Boosting Trees Model', 'Cross Validated Gradient Boosting Trees Model'], 
        'Accuracy': [accuracy_XGB_1, accuracy_XGB_2, accuracy_XGB_3],
        'Recall': [recall_XGB_1, recall_XGB_2, recall_XGB_3],
        'AUC': [auc_XGB_1, auc_XGB_2, auc_XGB_3]
        } 
comparison = pd.DataFrame(data) 
print(comparison)
                                           Model  Accuracy  Recall    AUC
0                Gradient Boosting Trees Model 1     0.883    0.88  0.944
1          Reduced Gradient Boosting Trees Model     0.880    0.88  0.942
2  Cross Validated Gradient Boosting Trees Model     0.876    0.88  0.939

Random Forest

Train a model

random_forest = RandomForestClassifier(n_estimators=128,min_samples_split=189,min_samples_leaf=7).fit(X_train, np.ravel(y_train))

Predict with a model

preds = random_forest.predict_proba(X_eval)

Create dataframes with predictions

preds_df = pd.DataFrame(preds[:,1], columns = ['prob_accept_deposit'])
true_df = y_eval
pred_comparison = pd.concat([true_df.reset_index(drop = True), preds_df], axis = 1)
pred_comparison.head(10)
   deposit  prob_accept_deposit
0        0             0.117996
1        0             0.126723
2        0             0.891273
3        0             0.105877
4        0             0.127400
5        0             0.390966
6        0             0.128073
7        0             0.136004
8        0             0.106904
9        1             0.222427

optimal cut-off search

numbers  = [float(x)/1000 for x in range(1000)]
for i in numbers:
    preds_df[i]= preds_df.prob_accept_deposit.map(lambda x: 1 if x > i else 0)
preds_df.head(5)
   prob_accept_deposit  0.0  0.001  0.002  ...  0.996  0.997  0.998  0.999
0             0.117996    1      1      1  ...      0      0      0      0
1             0.126723    1      1      1  ...      0      0      0      0
2             0.891273    1      1      1  ...      0      0      0      0
3             0.105877    1      1      1  ...      0      0      0      0
4             0.127400    1      1      1  ...      0      0      0      0

[5 rows x 1001 columns]
cutoff_df = pd.DataFrame( columns = ['prob','accs','def_recalls','nondef_recalls'])
for i in numbers:
    cm1 = metrics.confusion_matrix(true_df, preds_df[i])
    total1=sum(sum(cm1))
    accs = (cm1[0][0]+cm1[1][1])/total1
    
    def_recalls = cm1[1][1]/(cm1[1][1]+cm1[1][0])
    nondef_recalls = cm1[0][0]/(cm1[0][0]+cm1[0][1])
    cutoff_df.loc[i] =[ i ,accs,def_recalls,nondef_recalls]
print(cutoff_df.head(5))
        prob      accs  def_recalls  nondef_recalls
0.000  0.000  0.110392          1.0             0.0
0.001  0.001  0.110392          1.0             0.0
0.002  0.002  0.110392          1.0             0.0
0.003  0.003  0.110392          1.0             0.0
0.004  0.004  0.110392          1.0             0.0
cutoff_df <- py$cutoff_df

names(cutoff_df)[names(cutoff_df) == "prob"] <- "Probability cut-off"
names(cutoff_df)[names(cutoff_df) == "accs"] <- "Accuracy"
names(cutoff_df)[names(cutoff_df) == "def_recalls"] <- "Deposit Recall"
names(cutoff_df)[names(cutoff_df) == "nondef_recalls"] <- "No-deposit Recall"

cutoff_df <- cutoff_df %>% gather(key = "metric", value = "value", -`Probability cut-off`)

ggplot(cutoff_df, aes(x = `Probability cut-off`, y = value, color = metric)) + 
    geom_line(aes(linetype = metric)) +
    ggtitle("Accuracy, Deposit Recall and No-deposit Recall") +
    scale_fill_discrete(name = "Metrics", labels = c("Accuracy", "Deposit Recall", "No-deposit Recall"))

cutoff_df["diff"] = abs(cutoff_df.def_recalls - cutoff_df.nondef_recalls)
best_threshold = cutoff_df["prob"].loc[cutoff_df["diff"] == min(cutoff_df["diff"])]
best_threshold = best_threshold.iloc[0]
print(best_threshold)
0.574

Predict with a model

preds_df['prob_accept_deposit'] = preds_df['prob_accept_deposit'].apply(lambda x: 1 if x > best_threshold else 0)
target_names = ['No-deposit', 'Deposit']
print(classification_report(y_eval, preds_df['prob_accept_deposit'], target_names=target_names))
              precision    recall  f1-score   support

  No-deposit       0.98      0.86      0.92      5496
     Deposit       0.43      0.86      0.58       682

    accuracy                           0.86      6178
   macro avg       0.71      0.86      0.75      6178
weighted avg       0.92      0.86      0.88      6178
matrix_8 = confusion_matrix(y_eval,preds_df['prob_accept_deposit'])
print(matrix_8)
[[4732  764]
 [  94  588]]
accuracy_random_forest = round((matrix_8[0][0]+matrix_8[1][1])/sum(sum(matrix_8)), 3)
print(accuracy_random_forest)
0.861
recall_random_forest = round(matrix_8[1][1]/(matrix_8[1][1]+matrix_8[1][0]), 2)
print(recall_random_forest)
0.86
prob_deposit_random_forest = preds[:, 1]
auc_random_forest = round(roc_auc_score(y_eval, prob_deposit_random_forest), 3)
print(auc_random_forest)
0.934

Model Comparison

Logistic Regression Models

data = {'Model': ['Logistic Regression Model 1', 'Regularized Logistic Regression Model', 'Reduced Logistic Regression Model'], 
        'Accuracy': [accuracy_log_reg_1, accuracy_log_reg_2, accuracy_log_reg_3],
        'Recall': [recall_deposit_log_reg_1, recall_deposit_log_reg_2, recall_deposit_log_reg_3],
        'AUC': [auc_log_reg_1, auc_log_reg_2, auc_log_reg_3]
        } 
comparison = pd.DataFrame(data) 
print(comparison)
                                   Model  Accuracy  Recall    AUC
0            Logistic Regression Model 1     0.870   0.870  0.940
1  Regularized Logistic Regression Model     0.856   0.856  0.927
2      Reduced Logistic Regression Model     0.727   0.730  0.789
# ROC chart components
fallout_lr_1, sensitivity_lr_1, thresholds_lr_1 = roc_curve(y_eval, prob_deposit_log_reg_1)
fallout_lr_2, sensitivity_lr_2, thresholds_lr_2 = roc_curve(y_eval, prob_deposit_log_reg_2)
fallout_lr_3, sensitivity_lr_3, thresholds_lr_3 = roc_curve(y_eval, prob_deposit_log_reg_3)


# ROC Chart with both
plt.plot(fallout_lr_1, sensitivity_lr_1, color = 'blue', label='%s' % 'Logistic Regression')
plt.plot(fallout_lr_2, sensitivity_lr_2, color = 'red', label='%s' % 'Regularized Logistic Regression Model')
plt.plot(fallout_lr_3, sensitivity_lr_3, color = 'green', label='%s' % 'Reduced Logistic Regression Model')



plt.plot([0, 1], [0, 1], linestyle='--', label='%s' % 'Random Prediction')
plt.title("ROC Chart for all LR models on the Probability of Deposit")
plt.xlabel('Fall-out')
plt.ylabel('Sensitivity')
plt.legend()
plt.show()

plt.close()

XGBoost Models

data = {'Model': ['Gradient Boosting Trees Model 1', 'Reduced Gradient Boosting Trees Model', 'Cross Validated Gradient Boosting Trees Model'], 
        'Accuracy': [accuracy_XGB_1, accuracy_XGB_2, accuracy_XGB_3],
        'Recall': [recall_XGB_1, recall_XGB_2, recall_XGB_3],
        'AUC': [auc_XGB_1, auc_XGB_2, auc_XGB_3]
        } 
comparison = pd.DataFrame(data) 
print(comparison)
                                           Model  Accuracy  Recall    AUC
0                Gradient Boosting Trees Model 1     0.883    0.88  0.944
1          Reduced Gradient Boosting Trees Model     0.880    0.88  0.942
2  Cross Validated Gradient Boosting Trees Model     0.876    0.88  0.939
fallout_xgb_1, sensitivity_xgb_1, thresholds_xgb_1 = roc_curve(y_eval, prob_deposit_xgb_1)
fallout_xgb_2, sensitivity_xgb_2, thresholds_xgb_2 = roc_curve(y_eval, prob_deposit_xgb_2)
fallout_xgb_3, sensitivity_xgb_3, thresholds_xgb_3 = roc_curve(y_eval, prob_deposit_xgb_3)


# ROC Chart with both
plt.plot(fallout_xgb_1, sensitivity_xgb_1, color = 'blue', label='%s' % 'XGBoost Model')
plt.plot(fallout_xgb_2, sensitivity_xgb_2, color = 'red', label='%s' % 'Reduced XGBoost Model')
plt.plot(fallout_xgb_3, sensitivity_xgb_3, color = 'green', label='%s' % 'Cross Validated XGBoost Model')



plt.plot([0, 1], [0, 1], linestyle='--', label='%s' % 'Random Prediction')
plt.title("ROC Chart for all XGB models on the Probability of Deposit")
plt.xlabel('Fall-out')
plt.ylabel('Sensitivity')
plt.legend()
plt.show()

plt.close()

Random Forest

data = {'Model': ['Random Forest'], 
        'Accuracy': [accuracy_random_forest],
        'Recall': [recall_random_forest],
        'AUC': [auc_random_forest]
        } 
random_forest_results = pd.DataFrame(data) 
print(random_forest_results)
           Model  Accuracy  Recall    AUC
0  Random Forest     0.861    0.86  0.934
fallout_random_forest, sensitivity_random_forest, thresholds_random_forest = roc_curve(y_eval, prob_deposit_random_forest)

# ROC Chart with both
plt.plot(fallout_random_forest, sensitivity_random_forest, color = 'blue', label='%s' % 'Random Forest Model')



plt.plot([0, 1], [0, 1], linestyle='--', label='%s' % 'Random Prediction')
plt.title("ROC Chart for Random Forest Model on the Probability of Deposit")
plt.xlabel('Fall-out')
plt.ylabel('Sensitivity')
plt.legend()
plt.show()

plt.close()

All models

data = {'Model': ['Logistic Regression Model 1', 
                  'Regularized Logistic Regression Model', 
                  'Reduced Logistic Regression Model', 
                  'Gradient Boosting Trees Model 1', 
                  'Reduced Gradient Boosting Trees Model', 
                  'Cross Validated Gradient Boosting Trees Model',
                  'Random Forest'], 
        'Accuracy': [accuracy_log_reg_1, 
                     accuracy_log_reg_2, 
                     accuracy_log_reg_3, 
                     accuracy_XGB_1, 
                     accuracy_XGB_2, 
                     accuracy_XGB_3,
                     accuracy_random_forest],
        'Recall': [recall_deposit_log_reg_1, 
                   recall_deposit_log_reg_2, 
                   recall_deposit_log_reg_3, 
                   recall_XGB_1, 
                   recall_XGB_2, 
                   recall_XGB_3,
                   recall_random_forest],
        'AUC': [auc_log_reg_1, 
                auc_log_reg_2, 
                auc_log_reg_3, 
                auc_XGB_1, 
                auc_XGB_2, 
                auc_XGB_3,
                auc_random_forest]
        } 


comparison = pd.DataFrame(data) 
print(comparison.sort_values(["Accuracy", "Recall", "AUC"], ascending = False))
                                           Model  Accuracy  Recall    AUC
3                Gradient Boosting Trees Model 1     0.883   0.880  0.944
4          Reduced Gradient Boosting Trees Model     0.880   0.880  0.942
5  Cross Validated Gradient Boosting Trees Model     0.876   0.880  0.939
0                    Logistic Regression Model 1     0.870   0.870  0.940
6                                  Random Forest     0.861   0.860  0.934
1          Regularized Logistic Regression Model     0.856   0.856  0.927
2              Reduced Logistic Regression Model     0.727   0.730  0.789
# ROC Chart with both
plt.plot(fallout_lr_1, sensitivity_lr_1, color = 'blue', label='%s' % 'Logistic Regression')
plt.plot(fallout_lr_2, sensitivity_lr_2, color = 'red', label='%s' % 'Regularized Logistic Regression Model')
plt.plot(fallout_lr_3, sensitivity_lr_3, color = 'green', label='%s' % 'Reduced Logistic Regression Model')
plt.plot(fallout_xgb_1, sensitivity_xgb_1, color = 'yellow', label='%s' % 'XGBoost Model')
plt.plot(fallout_xgb_2, sensitivity_xgb_2, color = 'blueviolet', label='%s' % 'Reduced XGBoost Model')
plt.plot(fallout_xgb_3, sensitivity_xgb_3, color = 'orange', label='%s' % 'Cross Validated XGBoost Model')
plt.plot(fallout_random_forest, sensitivity_random_forest, color = 'orchid', label='%s' % 'Random Forest Model')



plt.plot([0, 1], [0, 1], linestyle='--', label='%s' % 'Random Prediction')
plt.title("ROC Chart for Random Forest Model on the Probability of Deposit")
plt.xlabel('Fall-out')
plt.ylabel('Sensitivity')
plt.legend()
plt.show()

plt.close()